[00:00:04] Deploy window WMF Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191128T0000) [00:00:52] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2242.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911280000_dzahn_213079_mw22... [00:01:55] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2218.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911280001_dzahn_213234_mw22... [00:02:00] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2218.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2218.codfw.wmnet'] ` [00:04:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Jclark-ctr) [00:04:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10Jclark-ctr) [00:05:27] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10Jclark-ctr) [00:06:56] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) [00:07:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Jclark-ctr) [00:08:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10Jclark-ctr) [00:09:51] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Jclark-ctr) [00:10:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1069 - https://phabricator.wikimedia.org/T227166 (10Jclark-ctr) [00:11:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Jclark-ctr) [00:12:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Jclark-ctr) [00:12:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Jclark-ctr) [00:14:35] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2244.codfw.wmnet'] ` and were **ALL** successful. [00:22:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:37] 10Operations, 10ops-eqiad: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10Jclark-ctr) Confirmed: Service Request 1004524493 was successfully submitted. [00:54:59] (03PS1) 10DannyS712: Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) [00:55:41] (03PS2) 10DannyS712: Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) [00:57:55] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2236.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911280057_dzahn_224625_mw22... [00:59:02] !log mw2244 restart php-fpm and apache which somehow are returning 5xx after reimage [00:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:15] (03PS1) 10Aklapper: phabricator weekly project changes email: Add more info about new assignees [puppet] - 10https://gerrit.wikimedia.org/r/553432 (https://phabricator.wikimedia.org/T227388) [01:05:55] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2227.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911280105_dzahn_228153_mw22... [01:07:29] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2242.codfw.wmnet'] ` and were **ALL** successful. [01:11:18] (03CR) 10Dzahn: [C: 03+2] "tested (as aklapper which is easiest now since the puppetized mysql access). works!" [puppet] - 10https://gerrit.wikimedia.org/r/553432 (https://phabricator.wikimedia.org/T227388) (owner: 10Aklapper) [01:12:40] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2218.codfw.wmnet'] ` and were **ALL** successful. [01:15:31] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2219.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911280115_dzahn_230669_mw22... [01:16:04] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2241.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911280115_dzahn_230758_mw22... [01:18:24] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [01:19:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:00] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) [01:34:22] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) [01:36:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:34] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (99084 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [01:52:34] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (99084 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [01:55:48] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10tstarling) >>! In T236963#5696069, @Legoktm wrote: > @tstarling is the gpg key that you used to sign that release available anywhere?... [02:03:42] RECOVERY - Wikitech and wt-static content in sync on cloudweb2001-dev is OK: wikitech-static OK - wikitech and wikitech-static in sync (99084 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [02:05:58] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10tstarling) keys.txt only has my 2008 and 2009 keys, since that's when I was doing MediaWiki releases. [02:09:17] (03PS1) 10Dzahn: assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) [02:09:40] (03CR) 10jerkins-bot: [V: 04-1] assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [02:09:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2236.codfw.wmnet'] ` and were **ALL** successful. [02:13:39] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) >>! In T239151#5698138, @thcipriani wrote: > Lowered the memory request as that seems to be out-of-line with the usage of most Ganeti VMs, hopefully 16G would wo... [02:14:01] (03PS2) 10Dzahn: assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) [02:14:24] (03CR) 10jerkins-bot: [V: 04-1] assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [02:17:18] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2227.codfw.wmnet'] ` and were **ALL** successful. [02:17:35] (03PS1) 10Dzahn: installserver: add gerrit1002 with flat/VM partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553438 (https://phabricator.wikimedia.org/T239151) [02:22:01] (03PS3) 10Dzahn: assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) [02:22:24] (03CR) 10jerkins-bot: [V: 04-1] assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [02:23:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2219.codfw.wmnet'] ` and were **ALL** successful. [02:25:38] (03PS4) 10Dzahn: assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) [02:27:28] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2241.codfw.wmnet'] ` and were **ALL** successful. [02:33:10] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [02:50:22] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:51:02] PROBLEM - puppet last run on an-tool1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:51:20] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:43] !log restarting keyholder on acmechief[12]001 [03:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:53] (03PS1) 10Krinkle: mediawiki: Avoid unsafe ob_start insnide php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/553439 (https://phabricator.wikimedia.org/T236832) [03:18:58] (03PS2) 10Krinkle: mediawiki: Avoid unsafe ob_start inside php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/553439 (https://phabricator.wikimedia.org/T236832) [03:23:11] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-publish: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315 (10Krinkle) >>! In T181315#4116607, @gerritbot wrote: > Change 425045 at [[https:... [03:25:38] (03PS3) 10Krinkle: mediawiki: Avoid unsafe ob_start inside php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/553439 (https://phabricator.wikimedia.org/T236832) [03:36:18] (03CR) 10Tim Starling: [C: 03+1] mediawiki: Avoid unsafe ob_start inside php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/553439 (https://phabricator.wikimedia.org/T236832) (owner: 10Krinkle) [03:36:28] 10Operations, 10Traffic, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10Krinkle) [04:00:08] (03CR) 10Krinkle: [C: 04-1] "I'm trying to verify that XHGui renders on that host, but doesn't seem to be where I think it is. Tried various versions of krinkle@xhgui1" [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [04:08:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:13:13] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [04:14:07] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [04:20:59] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [04:28:32] PROBLEM - PHP opcache health on mw2219 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:28:58] PROBLEM - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:36:54] PROBLEM - PHP opcache health on mw2227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:36:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:48:16] RECOVERY - PHP opcache health on mw2236 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:48:25] (03PS1) 10Phamhi: cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/553441 (https://phabricator.wikimedia.org/T224585) [04:57:58] RECOVERY - PHP opcache health on mw2227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:00:06] RECOVERY - PHP opcache health on mw2219 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:50:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1119 after schema change', diff saved to https://phabricator.wikimedia.org/P9775 and previous config saved to /var/cache/conftool/dbconfig/20191128-055025-marostegui.json [05:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 for schema change', diff saved to https://phabricator.wikimedia.org/P9776 and previous config saved to /var/cache/conftool/dbconfig/20191128-055212-marostegui.json [05:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:43] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Nuria) I think the collection that is heavy on cookies and tracking should have been reviewed by our privacy engineer @JFis... [05:58:31] (03PS1) 10Marostegui: mariadb: Set db2067 to spare [puppet] - 10https://gerrit.wikimedia.org/r/553444 (https://phabricator.wikimedia.org/T233185) [06:01:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2067 to spare [puppet] - 10https://gerrit.wikimedia.org/r/553444 (https://phabricator.wikimedia.org/T233185) (owner: 10Marostegui) [06:02:25] !log Remove db2067 from tendril and zarcillo T233185 [06:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:29] T233185: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 [06:13:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:12] !log Remove db1061 from tendril and zarcillo - T238624 [06:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:16] T238624: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 [06:15:36] (03PS1) 10Marostegui: mariadb: Remove references for db1061 [puppet] - 10https://gerrit.wikimedia.org/r/553445 (https://phabricator.wikimedia.org/T238624) [06:16:31] (03PS1) 10Marostegui: wmnet: Remove production entries for db1061 [dns] - 10https://gerrit.wikimedia.org/r/553446 (https://phabricator.wikimedia.org/T238624) [06:16:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove references for db1061 [puppet] - 10https://gerrit.wikimedia.org/r/553445 (https://phabricator.wikimedia.org/T238624) (owner: 10Marostegui) [06:18:03] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production entries for db1061 [dns] - 10https://gerrit.wikimedia.org/r/553446 (https://phabricator.wikimedia.org/T238624) (owner: 10Marostegui) [06:20:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 (10Marostegui) a:05Marostegui→03Jclark-ctr [06:20:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 (10Marostegui) Host ready for #dc-ops [06:21:52] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:54:39] RECOVERY - Disk space on an-tool1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-tool1007&var-datasource=eqiad+prometheus/ops [06:56:40] !log remove log files on an-tool1007 to free root partition space [06:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:24] (03PS1) 10Elukey: turnilo: limit logs to daemon.log/syslog [puppet] - 10https://gerrit.wikimedia.org/r/553448 [07:03:30] vgutierrez: --^ [07:05:08] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19674/an-tool1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/553448 (owner: 10Elukey) [07:09:27] RECOVERY - puppet last run on an-tool1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:09:56] elukey: nice :D [07:11:07] RECOVERY - Check systemd state on an-tool1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:13] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) Also to clarify further: Pybal does **none** of the meaningful load-balancing. Load-balancing between pods is done... [07:12:55] apparently turnilo was missing a restart, and it was spamming in the logs "no datasource etc.." [07:13:05] that was amplified to syslog and daemon.log [07:35:47] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) [07:40:35] (03PS1) 10Urbanecm: Add sewikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 [07:41:30] (03CR) 10jerkins-bot: [V: 04-1] Add sewikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (owner: 10Urbanecm) [07:46:10] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10elukey) [07:46:12] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10elukey) Thanks a lot! [07:47:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks! /me merging" [puppet] - 10https://gerrit.wikimedia.org/r/552947 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [07:48:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Switch from X-Real-IP to X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/552514 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [07:52:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] SSL: add certificate for OTRS/ticket.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/553421 (owner: 10Dzahn) [07:54:45] (03PS2) 10Urbanecm: Add sewikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) [07:56:42] !log reimage mw1345, mw1335, mw1325 [07:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:59] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1345.eqiad.wmnet', 'mw1335.eqiad.wmnet', 'mw1325.eqiad.wmnet'] ` The log can be found in `/var/log/... [08:06:15] !log Compress labsdb1012 [08:06:17] 10Operations, 10Acme-chief, 10Traffic: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Vgutierrez) [08:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:10] 10Operations, 10Toolforge, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Tool Labs / Tool Forge - https://phabricator.wikimedia.org/T210991 (10MoritzMuehlenhoff) JFTR, there's no immediate hurry, it was removed from Debian unstable, i.e. Diamond will... [08:11:13] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10MoritzMuehlenhoff) Two nits: When reimaging the servers (or when it's done), please also update the C... [08:15:40] !log reimage mw2280, mw2281, mw2282 [08:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:57] !log Remove m4 from tendril and zarcillo - T159170 [08:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:02] T159170: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 [08:22:48] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:07] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2280.codfw.wmnet', 'mw2281.codfw.wmnet', 'mw2282.codfw.wmnet'] ` The log can be found in `/var/log/... [08:24:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:10] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:53] (03CR) 10Alexandros Kosiaris: varnish/ATS: rename director for OTRS from mendelevium to otrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553423 (owner: 10Dzahn) [08:43:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I guess we need to have ticket.discovery.wmnet DNS RR first?" [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [08:47:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus::k8s: drop envoy metrics about the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/553246 (owner: 10Giuseppe Lavagetto) [08:53:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki: Avoid unsafe ob_start inside php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/553439 (https://phabricator.wikimedia.org/T236832) (owner: 10Krinkle) [08:56:44] !log Compress labsdb1011 [08:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) (owner: 10Giuseppe Lavagetto) [09:03:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] "The diff between the 2 certs is essentially the serial number and the validity. I think this should do fine." [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [09:11:17] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Core Platform Team (Needs Cleaning - Services Operations): Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10akosiaris) >>! In T219923#5685137, @Pchelolo wrote: > Graphoid is likely going a... [09:17:11] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:37] !log reimage mw1266, mw1276 [09:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:14] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1266.eqiad.wmnet', 'mw1276.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [09:19:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:36] (03CR) 10Muehlenhoff: install_server: standard recipe and raid1/raid10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:35:14] (03CR) 10Masumrezarock100: [C: 03+1] Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [09:37:11] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:20] 10Operations, 10Acme-chief, 10Traffic: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) I'm doing a quick debug attempt on `acmechief-test2001` [09:38:25] (03CR) 10Muehlenhoff: "Looks good, two comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553367 (owner: 10Jbond) [09:39:17] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:17] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:26] 10Operations, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team, 10Wikidata, 10Wikidata-Query-Service: Add dcausse to wikidata-query-deploy - https://phabricator.wikimedia.org/T239341 (10dcausse) [09:48:55] RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:23] !log swift eqiad-prod: more weight to ms-be105[7-9] - T237438 [09:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:28] T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 [09:52:35] (03PS3) 10Jbond: apereo_cas: ensure we log the correct client ip address and not nginx's [puppet] - 10https://gerrit.wikimedia.org/r/553367 [09:52:52] (03CR) 10Jbond: "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553367 (owner: 10Jbond) [09:54:01] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:35] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) [09:57:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553367 (owner: 10Jbond) [09:59:20] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) [10:02:56] (03PS2) 10Filippo Giunchedi: prometheus: alert on exporter's 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/553335 (https://phabricator.wikimedia.org/T187708) [10:03:57] (03CR) 10Filippo Giunchedi: prometheus: alert on exporter's 'up' metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553335 (https://phabricator.wikimedia.org/T187708) (owner: 10Filippo Giunchedi) [10:13:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2280.codfw.wmnet', 'mw2281.codfw.wmnet', 'mw2282.codfw.wmnet'] ` and were **ALL** successful. [10:19:10] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on exporter's 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/553335 (https://phabricator.wikimedia.org/T187708) (owner: 10Filippo Giunchedi) [10:24:51] ^ might generate some alerts, I'll keep an eye out [10:29:28] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10ema) p:05Triage→03Normal [10:33:13] (03CR) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:39:14] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) [10:39:25] !log Compress labsdb1009 [10:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This patch looks right. Please don't merge it until you are actually ready for the reimage. Also, this patch depends on one for operations" [puppet] - 10https://gerrit.wikimedia.org/r/553441 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [10:43:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:41] (03CR) 10Ladsgroup: Add sewikimedia to wikidataclient (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) (owner: 10Urbanecm) [10:45:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:41] (03CR) 10Urbanecm: Add sewikimedia to wikidataclient (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) (owner: 10Urbanecm) [10:51:27] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:20] (03PS1) 10Volans: keyholder: fix memory leak in Python 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) [10:53:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:33] !log reimage mw2279 mw2278 mw2277 mw2276 mw2275 [10:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:13] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2279.codfw.wmnet', 'mw2278.codfw.wmnet', 'mw2277.codfw.wmnet', 'mw2276.codfw.wmnet', 'mw2275.codfw.... [10:57:36] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) 05Open→03Resolved Reloading global Lua scripts now works. Closing. [11:02:09] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) I was able to debug the issue using `tracemalloc`: ` # at the top of the file import tracemalloc tracemalloc.start(5) # in the SshAgentProx... [11:02:15] (03CR) 10Volans: "More details available in the task, in particular in https://phabricator.wikimedia.org/T239386#5699449" [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [11:08:14] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10jbond) >>! In T239240#5697952, @Johan wrote: > I'm not really sure what you're looking for from us, @jbond but if you just want a sanity check: makes sense to me and the... [11:13:16] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:27] (03PS5) 10Jbond: tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 [11:14:55] (03CR) 10Volans: "FYI some related documentation is available here:" [puppet] - 10https://gerrit.wikimedia.org/r/553441 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [11:15:26] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:42] PROBLEM - Nginx local proxy to jobrunner on mw2281 is CRITICAL: connect to address 10.192.48.103 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:19:56] ^ checking [11:20:00] PROBLEM - Nginx local proxy to videoscaler on mw2281 is CRITICAL: connect to address 10.192.48.103 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:24:32] RECOVERY - Nginx local proxy to jobrunner on mw2281 is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:24:54] RECOVERY - Nginx local proxy to videoscaler on mw2281 is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:26:21] 10Operations, 10Traffic: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Seb35) For information, due to [[https://bugzilla.mozilla.org/show_bug.cgi?id=1002724|this bug in Firefox]], when the user type the URL without the "https://" prefix F... [11:30:11] 10Operations, 10Traffic: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) Hmm with HSTS the browser shouldn't even try port 80. [11:32:46] (03PS1) 10ArielGlenn: force start of statd on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/553462 [11:33:13] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1345.eqiad.wmnet', 'mw1335.eqiad.wmnet', 'mw1325.eqiad.wmnet'] ` and were **ALL** successful. [11:33:36] (03CR) 10ArielGlenn: [C: 03+2] force start of statd on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/553462 (owner: 10ArielGlenn) [11:35:05] (03CR) 10Hashar: "The code never has been deployed and beta is still broken T239399 :D" [puppet] - 10https://gerrit.wikimedia.org/r/553200 (owner: 10Jforrester) [11:36:48] !log reimage mw1344.eqiad.wmnet mw1334.eqiad.wmnet mw1324.eqiad.wmnet [11:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:15] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1344.eqiad.wmnet', 'mw1334.eqiad.wmnet', 'mw1324.eqiad.wmnet'] ` The log can be found in `/var/log/... [11:37:58] (03PS1) 10Muehlenhoff: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553463 [11:39:34] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1266.eqiad.wmnet', 'mw1276.eqiad.wmnet'] ` and were **ALL** successful. [11:50:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:33] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:15] (03CR) 10Ema: tlsproxy::localssl: allow users to specify an upstream ip address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553366 (owner: 10Jbond) [11:54:39] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 59.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:56:41] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 72.09 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:56:43] 10Operations, 10ops-esams, 10netops, 10procurement: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10mark) [11:59:23] !log reimage mw1267.eqiad.wmnet mw1277.eqiad.wmnet [11:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:42] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1267.eqiad.wmnet', 'mw1277.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [12:00:23] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] (03PS3) 10Jbond: tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 [12:06:32] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10DannyS712) Just came across this at https://en.wikipedia.org/wiki/Template_talk:; when I couldn't access the pa... [12:18:40] (03PS1) 10Phamhi: cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) [12:21:48] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:53] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "We need to split this in two. In the first stage we add the new prod FQDNs and new mgmt FQDNs (while leaving old one as well). A later pat" [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:23:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:29] (03PS2) 10Ema: ATS: pass uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/553132 (https://phabricator.wikimedia.org/T238494) [12:45:59] (03PS1) 10Alexandros Kosiaris: blubberoid: Harmonize eqiad limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/553469 [12:50:14] (03PS3) 10Effie Mouzeli: Remove unused "multi" thumbor handler [puppet] - 10https://gerrit.wikimedia.org/r/505837 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [12:51:06] !log disable puppet on thumbor* [12:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 (owner: 10Bstorm) [12:55:59] (03CR) 10Effie Mouzeli: [C: 03+2] Remove unused "multi" thumbor handler [puppet] - 10https://gerrit.wikimedia.org/r/505837 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles) [12:56:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] dnsrecursor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/543803 (owner: 10Muehlenhoff) [13:08:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:18] !log enable puppet on thumbor* [13:15:30] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] (03PS2) 10Phamhi: cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) [13:15:34] (03PS1) 10Giuseppe Lavagetto: wmflib: add inject_secret [puppet] - 10https://gerrit.wikimedia.org/r/553473 [13:15:36] (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: allow injecting secrets [puppet] - 10https://gerrit.wikimedia.org/r/553474 [13:15:50] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:18:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:31] !log cumin 'netmon*' 'rm -v /var/spool/cron/crontabs/postgres' T238919 [13:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:36] T238919: Cleanup Netbox stuff from netmon hosts - https://phabricator.wikimedia.org/T238919 [13:24:29] (03PS3) 10Phamhi: cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) [13:26:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P9777 and previous config saved to /var/cache/conftool/dbconfig/20191128-132647-marostegui.json [13:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:22] !log cleanup root's crontab entries on netmon hosts from netbox/postres stuff - T238919 [13:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:26] T238919: Cleanup Netbox stuff from netmon hosts - https://phabricator.wikimedia.org/T238919 [13:30:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for schema change, temporarily pool db1080 as vslow,dump', diff saved to https://phabricator.wikimedia.org/P9778 and previous config saved to /var/cache/conftool/dbconfig/20191128-133013-marostegui.json [13:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:38] !log Remove ar_comment triggers from db1124:3311 for enwiki.archive - T234704 [13:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:46] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [13:33:00] !log phamhi@cumin1001 START - Cookbook sre.hosts.downtime [13:33:00] !log phamhi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:36] 10Operations, 10netops: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10Volans) p:05Triage→03Normal [13:37:03] !log Recreate views for enwiki_p.protected_titles for all labsdb hosts - T233135 [13:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:08] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [13:37:26] !log Deploy schema change on db1106 with replication (lag will appear on s1 on labs) - T234066 T233135 [13:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:32] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [13:37:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:42] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The nl-ams-as14907 anchor is now fully online and has ID #6671. [13:38:49] (03PS1) 10Faidon Liambotis: Add ripe-atlas-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/553476 (https://phabricator.wikimedia.org/T174637) [13:38:57] godog: want to review ^ ? [13:44:03] paravoid: for sure! [13:44:35] thanks :)_ [13:46:14] (03PS4) 10Jbond: tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 [13:46:41] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553366 (owner: 10Jbond) [13:46:43] (03CR) 10Filippo Giunchedi: [C: 03+1] Add ripe-atlas-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/553476 (https://phabricator.wikimedia.org/T174637) (owner: 10Faidon Liambotis) [13:46:50] (03PS4) 10Jbond: apereo_cas: ensure we log the correct client ip address and not nginx's [puppet] - 10https://gerrit.wikimedia.org/r/553367 [13:47:29] unrelated but does the ripe atlas measurement map work for you ? I was checking the related measurements and the map takes forever to load and stalls [13:47:36] I didn't check [13:47:38] (03CR) 10Faidon Liambotis: [C: 03+2] Add ripe-atlas-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/553476 (https://phabricator.wikimedia.org/T174637) (owner: 10Faidon Liambotis) [13:47:49] (03CR) 10Phamhi: [C: 03+2] cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/553441 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:48:10] (03PS2) 10Phamhi: cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/553441 (https://phabricator.wikimedia.org/T224585) [13:48:46] curious, anyways \o/ for all anchors in place [13:48:51] I'm sure there's a kubernetes joke in there [13:49:49] 10Operations, 10ops-esams, 10netops, 10Patch-For-Review: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) [13:49:59] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061 (10faidon) [13:50:02] 10Operations, 10ops-esams, 10netops, 10Patch-For-Review: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) 05Open→03Resolved a:03faidon All done! [13:50:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:51:23] (03CR) 10Phamhi: [C: 03+2] cloudvps: rename+reimage labmon1002 as cloudmetrics1002 [dns] - 10https://gerrit.wikimedia.org/r/553467 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:52:17] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:08] !log Deploy schema change on s4 codfw master with replication - T234066 [13:57:09] !log start of mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size 5 (T237984) [13:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:12] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [13:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [14:01:35] (03PS1) 10Volans: daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) [14:03:13] (03CR) 10jerkins-bot: [V: 04-1] daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [14:08:19] (03CR) 10Jbond: [C: 04-1] "great improvement but the lookup signature is currently wrong" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [14:11:45] (03PS2) 10Volans: daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) [14:12:09] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2279.codfw.wmnet', 'mw2278.codfw.wmnet', 'mw2277.codfw.wmnet', 'mw2276.codfw.wmnet', 'mw2275.codfw.wmnet'] ` and were **ALL** successful. [14:12:39] (03PS2) 10Volans: keyholder: fix memory leak in Python 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) [14:12:47] jouncebot next [14:12:47] In 9 hour(s) and 47 minute(s): No deploys! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191129T0000) [14:12:50] jouncebot now [14:12:50] For the next 9 hour(s) and 47 minute(s): WMF Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191128T0000) [14:14:01] (03PS3) 10Volans: daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) [14:15:37] (03CR) 10jerkins-bot: [V: 04-1] daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [14:16:49] (03PS4) 10Volans: daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) [14:20:18] !log Deploy schema change on s3 codfw on the master, lag will appear on s3 codfw (T234066) [14:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:23] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [14:20:45] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:52] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1267.eqiad.wmnet', 'mw1277.eqiad.wmnet'] ` and were **ALL** successful. [14:25:36] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10jbond) 05Open→03Resolved a:03jbond Hello Isaac, I have now created the wikilovesafrica mailing list. you should now be able to access the [[ https://lists.wikimed... [14:29:03] !log reimage mw1343.eqiad.wmnet mw1342.eqiad.wmnet mw1341.eqiad.wmnet [14:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1343.eqiad.wmnet', 'mw1342.eqiad.wmnet', 'mw1341.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:29:39] (03PS2) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [14:29:41] (03PS1) 10Filippo Giunchedi: install_server: move custom partman recipes to partman/custom [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955) [14:29:43] (03PS1) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [14:36:44] (03Abandoned) 10Filippo Giunchedi: install_server: standard partman example [puppet] - 10https://gerrit.wikimedia.org/r/553364 (owner: 10Filippo Giunchedi) [14:47:15] (03PS1) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [14:47:17] (03PS1) 10Jbond: ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) [14:49:35] (03CR) 10jerkins-bot: [V: 04-1] ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [14:50:05] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [14:51:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:30] PROBLEM - mediawiki-installation DSH group on mw1325 is CRITICAL: Host mw1325 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:53:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:57] 10Operations, 10Traffic: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Seb35) Yes, indeed, I have to precise my test was with a non-HSTS site, and it seems there is no issue with HSTS-preloaded sites according to [[https://bugzilla.mozill... [15:00:19] (03PS1) 10Ema: ATS: re-use origin server connections for matching IPs [puppet] - 10https://gerrit.wikimedia.org/r/553490 (https://phabricator.wikimedia.org/T238494) [15:01:01] (03PS2) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [15:01:15] (03PS2) 10Jbond: ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) [15:03:22] (03CR) 10jerkins-bot: [V: 04-1] ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [15:03:55] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [15:04:52] !log reimage mw1333.eqiad.wmnet mw1332.eqiad.wmnet mw1331.eqiad.wmnet [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:33] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1333.eqiad.wmnet', 'mw1332.eqiad.wmnet', 'mw1331.eqiad.wmnet'] ` The log can be found in `/var/log/... [15:08:15] (03PS3) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [15:09:35] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1344.eqiad.wmnet', 'mw1334.eqiad.wmnet', 'mw1324.eqiad.wmnet'] ` and were **ALL** successful. [15:10:41] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [15:10:54] (03PS3) 10Jbond: ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) [15:12:28] (03PS4) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [15:12:58] (03CR) 10jerkins-bot: [V: 04-1] ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [15:13:31] (03PS4) 10Jbond: ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) [15:14:48] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [15:15:36] (03CR) 10jerkins-bot: [V: 04-1] ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [15:18:21] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) The `.git` directory is not versioned, therefore a git hook is by nature, opt-in. If we go down the route of using black i would suggest creating a git h... [15:26:01] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10Volans) @jbond on the CI instances you have 3.4, 3.5, 3.6 and 3.7 available although the system one is 3.5. Faidon did the packaging a while ago and if you see t... [15:27:34] jbond42, volans: https://people.debian.org/~paravoid/python-all/ is the latest of ^^^ [15:28:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:40] thanks [15:29:57] that's not to say that CI shouldn't probably rely on something more recent :P [15:30:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:57] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:43:49] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:46:19] mmmm [15:46:52] (03PS5) 10Jbond: tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 [15:48:25] 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link [15:49:18] the neighborg down was cr1-codfw (103.102.166.139) [15:50:09] that is the Telia link [15:50:33] so I guess that the warning is related to a temp switch to the tunnel to ulsfo? [15:52:47] (03PS1) 10Muehlenhoff: Switch log level back to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/553496 [15:53:25] yes https://librenms.wikimedia.org/graphs/to=1574956200/id=17856/type=port_bits/from=1574869800/ [15:54:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553496 (owner: 10Muehlenhoff) [15:55:02] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10ema) We could also think of writing a sort of HTTP router that returns a list of PyBal AP... [15:55:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch log level back to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/553496 (owner: 10Muehlenhoff) [15:55:31] ok it is zero now, all stable [15:55:46] no idea exactly what happened, but let's keep an eye [15:56:38] (03CR) 10Ema: [C: 03+1] tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 (owner: 10Jbond) [15:57:28] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link [15:58:18] (03CR) 10Ema: [C: 03+1] tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 (owner: 10Jbond) [15:58:34] !log reimage mw1311.eqiad.wmnet [15:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:58] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1311.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911281558_jiji_182152.log`. [15:59:16] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts: ` labmon1002.eqiad.w... [16:01:15] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Joe) >>! In T239392#5700283, @ema wrote: > We could also think of writing a sort of HTTP r... [16:02:43] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10Wikicology) Thank you jbond. Your help is much appreciated. Isaac [16:05:14] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:06] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1333.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911281608_jiji_185447.log`. [16:12:45] (03PS5) 10Jbond: apereo_cas: ensure we log the correct client ip address and not nginx's [puppet] - 10https://gerrit.wikimedia.org/r/553367 [16:13:02] (03CR) 10Jbond: [C: 03+2] apereo_cas: ensure we log the correct client ip address and not nginx's [puppet] - 10https://gerrit.wikimedia.org/r/553367 (owner: 10Jbond) [16:13:07] (03CR) 10Jbond: [C: 03+2] tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 (owner: 10Jbond) [16:13:11] (03CR) 10Jbond: [C: 03+2] tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 (owner: 10Jbond) [16:15:39] !log phamhi@cumin1001 START - Cookbook sre.hosts.downtime [16:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:49] !log phamhi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:18:53] (03PS1) 10Jbond: profile::idp: ensure upstream_port is a stdlib::port not string [puppet] - 10https://gerrit.wikimedia.org/r/553504 [16:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:01] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553504 (owner: 10Jbond) [16:21:15] (03CR) 10Jbond: [C: 03+2] profile::idp: ensure upstream_port is a stdlib::port not string [puppet] - 10https://gerrit.wikimedia.org/r/553504 (owner: 10Jbond) [16:23:09] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:38] (03PS1) 10Jbond: profile::idp: proxy to localhost not the real ip address [puppet] - 10https://gerrit.wikimedia.org/r/553505 [16:25:00] (03PS1) 10Filippo Giunchedi: prometheus: fix wording and link for exporters availability [puppet] - 10https://gerrit.wikimedia.org/r/553506 [16:26:06] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) >>! In T239334#5700195, @Volans wrote: > @jbond on the CI instances you have 3.4, 3.5, 3.6 and 3.7 available although the system one is 3.5. Faidon did th... [16:26:12] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1002.eqiad.wmnet'] ` and were **ALL** successful. [16:26:32] (03CR) 10Jbond: [C: 03+2] profile::idp: proxy to localhost not the real ip address [puppet] - 10https://gerrit.wikimedia.org/r/553505 (owner: 10Jbond) [16:27:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix wording and link for exporters availability [puppet] - 10https://gerrit.wikimedia.org/r/553506 (owner: 10Filippo Giunchedi) [16:30:07] 10Operations, 10Traffic, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10jbond) p:05Triage→03Normal [16:31:01] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10jbond) p:05Triage→03Normal [16:32:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:30] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [16:36:52] (03PS2) 10Giuseppe Lavagetto: wmflib: add inject_secret [puppet] - 10https://gerrit.wikimedia.org/r/553473 [16:36:54] (03PS2) 10Giuseppe Lavagetto: deployment_server::helmfile: allow injecting secrets [puppet] - 10https://gerrit.wikimedia.org/r/553474 [16:36:56] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: parametrize adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/553508 [16:39:45] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [16:59:00] (03PS1) 10Cparle: Turn off redirect on exact search match for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) [16:59:04] (03CR) 10Alexandros Kosiaris: "I guess needs to be rebased on top of I2b3d19df5a5c544fac0201a20c804f6022663c5e" [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [16:59:41] (03CR) 10Cparle: [C: 04-2] "Do not merge until the week after thanksgiving" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [17:00:04] (03CR) 10jerkins-bot: [V: 04-1] Turn off redirect on exact search match for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [17:03:03] (03PS2) 10Cparle: Turn off redirect on exact search match for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) [17:04:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1311.eqiad.wmnet'] ` and were **ALL** successful. [17:12:55] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:55] !log reimage mw1340.eqiad.wmnet mw1339.eqiad.wmnet [17:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:48] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1340.eqiad.wmnet', 'mw1339.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [17:22:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1333.eqiad.wmnet'] ` and were **ALL** successful. [17:27:28] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [17:27:51] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553463 (owner: 10Muehlenhoff) [17:28:23] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [17:30:01] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1332.eqiad.wmnet', 'mw1331.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [17:36:59] (03PS2) 10Giuseppe Lavagetto: profile::etcd::v3: parametrize adv_client_port [puppet] - 10https://gerrit.wikimedia.org/r/553508 [17:37:01] (03PS1) 10Giuseppe Lavagetto: role::etcd::v3::kubernetes: Add role for etcd3 backing k8s [puppet] - 10https://gerrit.wikimedia.org/r/553512 [17:37:56] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) We have ordered the hosts, 3 for eqiad and 3 for codfw (see related subtasks). The next... [17:38:15] effie: --^ let me know what you think about it when you have a minute [17:38:27] could be interesting with the current reimages ongoing [17:42:51] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:10] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1343.eqiad.wmnet', 'mw1342.eqiad.wmnet', 'mw1341.eqiad.wmnet'] ` and were **ALL** successful. [18:05:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 after schema change', diff saved to https://phabricator.wikimedia.org/P9779 and previous config saved to /var/cache/conftool/dbconfig/20191128-180517-marostegui.json [18:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:40] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1332.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911281825_jiji_214589.log`. [18:28:57] !log reimage w1319.eqiad.wmnet mw1318.eqiad.wmnet [18:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:02] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1319.eqiad.wmnet', 'mw1318.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [18:39:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 for schema change', diff saved to https://phabricator.wikimedia.org/P9780 and previous config saved to /var/cache/conftool/dbconfig/20191128-183918-marostegui.json [18:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:17] !log Deploy schema change on db1134 [18:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [18:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:14] PROBLEM - mediawiki-installation DSH group on mw1333 is CRITICAL: Host mw1333 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:52:13] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [18:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:00] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [18:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:22] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:30] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:40] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:12:46] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:54] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:13:04] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:32:54] RECOVERY - mediawiki-installation DSH group on mw1325 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:38:00] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1332.eqiad.wmnet'] ` and were **ALL** successful. [19:45:15] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1340.eqiad.wmnet', 'mw1339.eqiad.wmnet'] ` and were **ALL** successful. [19:48:17] !log reimage mw1331.eqiad.wmnet mw1330.eqiad.wmnet mw1310.eqiad.wmnet [19:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:56] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1331.eqiad.wmnet', 'mw1330.eqiad.wmnet', 'mw1310.eqiad.wmnet'] ` The log can be found in `/var/log/... [20:02:38] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [20:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:56] !log reimage mw1313.eqiad.wmnet [20:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:14] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1313.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911282004_jiji_236588.log`. [20:04:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:59] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [20:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:13] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:02] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) I am looking into a recently discovered issue with uwsgi-graphite-web.service ` Nov 28 20:14:12 phamhi-labmon uwsg... [20:26:19] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [20:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1319.eqiad.wmnet', 'mw1318.eqiad.wmnet'] ` and were **ALL** successful. [21:11:14] !log reimage mw1316.eqiad.wmnet mw1315.eqiad.wmnet [21:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:39] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1316.eqiad.wmnet', 'mw1315.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [21:16:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1313.eqiad.wmnet'] ` and were **ALL** successful. [21:19:04] !log reimage mw1323.eqiad.wmnet [21:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:51] !log reimage mw1309.eqiad.wmnet [21:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:18] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1313.eqiad.wmnet', 'mw1309.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [21:23:57] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [21:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:20] RECOVERY - mediawiki-installation DSH group on mw1333 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:33:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [21:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:27] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [21:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:44] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:16] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [22:35:19] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [22:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:00] PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance=cp3064:9536 site=esams tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [22:47:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [22:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [22:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:05] !log restart cp1087 [23:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:12] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [23:07:40] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [23:10:29] 10Operations, 10Traffic: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10jijiki) [23:12:58] RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [23:14:58] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1331.eqiad.wmnet', 'mw1330.eqiad.wmnet', 'mw1310.eqiad.wmnet'] ` and were **ALL** successful. [23:18:09] 10Operations, 10Traffic: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Volans) It might be another occurrence of T238305 (model matches) [23:21:14] !log reimage mw1329.eqiad.wmnet [23:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:51] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1329.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911282321_jiji_18834.log`. [23:38:47] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1313.eqiad.wmnet', 'mw1309.eqiad.wmnet'] ` and were **ALL** successful. [23:42:32] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1316.eqiad.wmnet', 'mw1315.eqiad.wmnet'] ` and were **ALL** successful. [23:43:54] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [23:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log