[00:04:19] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01127 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:29:37] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1021011952 and 210 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:29:37] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 43934504 and 210 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:29:37] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 313025656 and 210 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:32:29] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 274850136 and 381 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:13] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3005713888 and 426 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:13] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 963033032 and 426 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:49] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1599119760 and 461 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:34:37] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 50784056 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:57] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 250497040 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:05] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1778280 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:25] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1416768 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:25] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 828488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:41] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 369926520 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:41] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 520213624 and 488 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:45] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41258176 and 492 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:09] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 23528 and 72 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:19] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 174940824 and 586 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:33] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11088 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:39] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 488 and 220 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:11] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 58816 and 254 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:59] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 67440 and 302 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:17] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 209052792 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:07] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21562024 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:11] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36939632 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:03] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 58553888 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:43] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 451303744 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:13] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 43520928 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:31] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 832376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:09] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1550000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:33] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 966872 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:39] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1493024 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:11] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1082672 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:59] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 288636512 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:37] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 43811856 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:37] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524437488 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:03] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 856530272 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:39] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 119452152 and 41 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:59] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 99584 and 114 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:07] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50264 and 121 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:31] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 15136 and 146 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:37] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 37368 and 151 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:09] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 45752 and 183 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:45:16] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10ops-monitoring-bot) [07:14:00] (03PS3) 10Ammarpad: Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 [08:00:47] (03CR) 10Muehlenhoff: "If we used a null-routed email address for all those years, I think that means it's safe to not set this at all in the Apache config? noc@" [puppet] - 10https://gerrit.wikimedia.org/r/645431 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [08:02:52] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service definition for RT [puppet] - 10https://gerrit.wikimedia.org/r/645306 (owner: 10Muehlenhoff) [08:04:44] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [08:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:53] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [08:21:41] 10Operations, 10ops-eqiad: sdg1 failed on ms-be1054 - https://phabricator.wikimedia.org/T269556 (10fgiunchedi) [08:47:04] !log add 300G to prometheus global (eqiad) [08:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:13] (03CR) 10Ladsgroup: "The language looks RTL to me, we should add it there otherwise we will have explosions like the last time (and its wikipedia too)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) (owner: 10Urbanecm) [08:50:11] (03PS1) 10DCausse: [wdqs] proper selector for machines running the streaming-updater (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/646621 (https://phabricator.wikimedia.org/T266986) [08:52:09] PROBLEM - HP RAID on ms-be1054 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:55:02] known ^ [09:02:31] !log bounce apache2 on prometheus1003 [09:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:01] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [09:18:59] (03CR) 10DCausse: "pcc output: https://puppet-compiler.wmflabs.org/compiler1003/26983/" [puppet] - 10https://gerrit.wikimedia.org/r/646621 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse) [09:27:03] 10Operations, 10observability: Increased icinga check latency since 05/12 - https://phabricator.wikimedia.org/T269560 (10fgiunchedi) [09:30:07] (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS for RT [puppet] - 10https://gerrit.wikimedia.org/r/645334 (owner: 10Muehlenhoff) [09:30:20] the ms-be2 hosts down alerts is me, not yet in production [09:32:59] PROBLEM - Host ms-be2060 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:19] RECOVERY - Host ms-be2060 is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms [09:38:43] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269409 (10fgiunchedi) [09:38:46] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [09:39:12] (03PS1) 10DCausse: [wdqs] re-enable polling kafka for updates on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/646631 (https://phabricator.wikimedia.org/T267175) [09:39:14] (03PS1) 10DCausse: Revert "wdqs: use RecentChanges API for updates on all WDQS servers" [puppet] - 10https://gerrit.wikimedia.org/r/646632 (https://phabricator.wikimedia.org/T267175) [09:39:39] PROBLEM - Host ms-be2060 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:43] RECOVERY - Host ms-be2060 is UP: PING OK - Packet loss = 0%, RTA = 31.88 ms [09:44:20] 10Operations, 10SRE-tools, 10observability: HP RAID failed on ms-be1054 didn't open a task - https://phabricator.wikimedia.org/T269563 (10fgiunchedi) [09:52:14] (03CR) 10Filippo Giunchedi: "LGTM modulo what Arzhel said." [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) (owner: 10Cwhite) [09:52:46] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [09:53:30] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: make a logstash templates directory and relocate existing templates [puppet] - 10https://gerrit.wikimedia.org/r/645200 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [09:56:27] (03PS1) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 [09:57:13] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10LSobanski) [10:04:11] 10Operations, 10observability: Increased icinga check latency since 05/12 - https://phabricator.wikimedia.org/T269560 (10fgiunchedi) [10:16:38] (03PS3) 10DCausse: [cirrus] cleanup mediasearch commons A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634991 [10:16:40] (03PS2) 10DCausse: [cirrus] flip activation of MLR rescore window using supported_syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634992 [10:16:42] (03PS2) 10DCausse: [cirrus] A/B test perfield build on spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635313 (https://phabricator.wikimedia.org/T266027) [10:45:32] PROBLEM - Host db1139 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:58] (03PS1) 10Effie Mouzeli: WIP: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 [11:06:25] (03CR) 10jerkins-bot: [V: 04-1] WIP: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (owner: 10Effie Mouzeli) [11:15:10] (03CR) 10Gmodena: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [11:25:22] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646642 (https://phabricator.wikimedia.org/T128546) [11:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1130). Please do the needful. [11:30:51] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646642 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:35] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646642 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:34:45] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:646642| Bumping portals to master (T128546)]] (duration: 01m 40s) [11:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:55] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:35:52] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:646642| Bumping portals to master (T128546)]] (duration: 01m 06s) [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:35] (03PS1) 10Alexandros Kosiaris: profile::kubernetes::node: Remove old redundant code [puppet] - 10https://gerrit.wikimedia.org/r/646645 [11:44:37] (03PS1) 10Alexandros Kosiaris: k8s::node: Split staging cluster hieras [puppet] - 10https://gerrit.wikimedia.org/r/646646 [11:52:05] (03PS1) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 [11:52:24] (03PS2) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [11:53:49] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:56:20] (03CR) 10Kosta Harlan: "Notes/open questions:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:57:05] (03PS2) 10Alexandros Kosiaris: profile::kubernetes::node: Remove old redundant code [puppet] - 10https://gerrit.wikimedia.org/r/646645 [11:57:12] (03PS2) 10Alexandros Kosiaris: k8s::node: Split staging cluster hieras [puppet] - 10https://gerrit.wikimedia.org/r/646646 [12:00:44] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26985/console" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris) [12:03:10] no ping from jouncebot? [12:03:21] jouncebot: now [12:03:21] For the next 0 hour(s) and 56 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1200) [12:03:35] 10Operations, 10Analytics, 10SRE-Access-Requests: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10ssingh) a:03ssingh [12:03:55] o/ [12:04:02] 10Operations, 10Analytics, 10SRE-Access-Requests: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10ssingh) (Additional context: T267314). [12:04:15] (03PS1) 10Ssingh: admin: enable kerberos for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/646652 (https://phabricator.wikimedia.org/T269472) [12:04:19] !log installing Linux 4.19.160 updates from Buster point release (initially only package updates, no reboots yet) [12:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:19] (03CR) 10Muehlenhoff: [C: 03+1] admin: enable kerberos for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/646652 (https://phabricator.wikimedia.org/T269472) (owner: 10Ssingh) [12:05:31] (03CR) 10Kosta Harlan: "I've looked at the lint error but I'm not sure what to do; do I need to update charts/linkrecommendation/templates/service.yaml ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:05:39] Ammarpad: are you here? [12:06:25] (03CR) 10Ssingh: [C: 03+2] admin: enable kerberos for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/646652 (https://phabricator.wikimedia.org/T269472) (owner: 10Ssingh) [12:07:12] Lucas_WMDE: dcausse: according to my clock, it's B.&C time, but I don'tb recall any ping [12:07:16] is there sth for me to do? :D [12:07:25] Urbanecm: yes, I was also confused [12:07:40] I can start deploying dcausse’s changes if Ammarpad isn’t around yet [12:07:43] or do you want to do it yourself dcausse? [12:08:22] Lucas_WMDE: either way :) [12:08:46] then it’s probably easier if you do it? :) [12:08:57] sure, deploying then :) [12:08:58] and you can test it as well [12:09:21] dcausse: please ping me once you're done :) [12:09:39] (03CR) 10DCausse: [C: 03+2] [cirrus] cleanup mediasearch commons A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634991 (owner: 10DCausse) [12:09:49] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10ssingh) 05Open→03Resolved Hi @Swagoel: You should have received an email with the Kerberos password. Please let us know if there are any issues, thanks! [12:10:16] (03PS2) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [12:10:53] (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [12:11:02] (03Merged) 10jenkins-bot: [cirrus] cleanup mediasearch commons A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634991 (owner: 10DCausse) [12:11:38] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad) [12:12:34] (03PS1) 10Muehlenhoff: Stop installing apt-transport-https on Buster [puppet] - 10https://gerrit.wikimedia.org/r/646654 [12:15:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 (owner: 10Ammarpad) [12:15:34] (03PS3) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [12:15:48] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] cleanup mediasearch commons A/B test (duration: 01m 06s) [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:39] (03CR) 10DCausse: [C: 03+2] [cirrus] flip activation of MLR rescore window using supported_syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634992 (owner: 10DCausse) [12:17:47] (03Merged) 10jenkins-bot: [cirrus] flip activation of MLR rescore window using supported_syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634992 (owner: 10DCausse) [12:18:33] (03PS1) 10Kosta Harlan: linkrecommendation: Add private config for DB write user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T265893) [12:20:55] (03PS2) 10Kosta Harlan: linkrecommendation: Add private config for DB write user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) [12:21:12] (03PS4) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [12:21:28] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: [cirrus] flip activation of MLR rescore window using supported_syntax (duration: 01m 06s) [12:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:02] (03PS3) 10DCausse: [cirrus] A/B test perfield build on spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635313 (https://phabricator.wikimedia.org/T266027) [12:24:28] (03CR) 10DCausse: [C: 03+2] [cirrus] A/B test perfield build on spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635313 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [12:25:17] (03Merged) 10jenkins-bot: [cirrus] A/B test perfield build on spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635313 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [12:25:59] jouncebot: no_justification [12:26:01] grr [12:26:03] sorr [12:26:05] jouncebot: now [12:26:05] For the next 0 hour(s) and 33 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1200) [12:26:11] jouncebot: next [12:26:11] In 5 hour(s) and 33 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1800) [12:26:42] (03PS2) 10Alexandros Kosiaris: recommendation-api: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/641938 (https://phabricator.wikimedia.org/T241230) [12:26:44] (03PS2) 10Alexandros Kosiaris: recommendation-api: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/641939 (https://phabricator.wikimedia.org/T241230) [12:26:52] !log rollour scap 3.16.0-1 to canaries [12:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:02] !log rollour scap 3.16.0-1 to canaries - T268634 [12:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:10] T268634: Deploy Scap version 3.16.0-1 - https://phabricator.wikimedia.org/T268634 [12:27:23] effie: ehm...I'm sure you know what you're doing, but is it a good idea to deploy scap when it's being actively used? :-) [12:28:15] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T266027: [cirrus] A/B test perfield build on spaceless languages (duration: 01m 07s) [12:28:21] Urbanecm: the process is to deploy scap to canaries (usually on a monday) [12:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:23] T266027: Test perfield_builder on spaceless languages - https://phabricator.wikimedia.org/T266027 [12:29:26] (03PS3) 10Alexandros Kosiaris: recommendation-api: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/641939 (https://phabricator.wikimedia.org/T241230) [12:29:26] Urbanecm: I would wait for dcauss.e to finish up :p [12:29:52] Urbanecm, Lucas_WMDE, effie I'm done :) [12:30:04] effie: I just noticed your !_log statements, and thought that means scap changes at the real mw hosts :-) [12:30:12] effie: would you mind me scapping something now? :-) [12:30:23] still no news from Ammarpad apparently [12:30:34] would you like to use the super new version of scap though Urbanecm ? [12:30:51] effie: well I can try it if you wish :) [12:30:52] if you give me 10', you will [12:30:55] sure [12:30:59] (03CR) 10JMeybohm: "> 1. I copied the files from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076 and then references values.yaml for li" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:31:02] that would be lovely, thank you [12:31:03] effie: ping me once ready then :) [12:31:06] sure tx [12:32:04] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 831681344 and 345 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:08] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 51314816 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:08] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 100861200 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:16] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18008128 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:18] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 257212944 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:28] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20039384 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:40] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1714376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:50] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2042064 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:18] (03CR) 10JMeybohm: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:38:16] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 576570712 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:24] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 253994320 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:36] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 75016928 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:06] Urbanecm: cool, go ahead [12:39:16] let me know if something is odd [12:39:31] sure [12:39:43] (03PS2) 10Urbanecm: Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad) [12:40:01] (03CR) 10Urbanecm: [C: 03+2] Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad) [12:40:16] I'll deploy it anyway, I feel confident enough [12:40:42] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1091321904 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:57] (03Merged) 10jenkins-bot: Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad) [12:41:14] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36108400 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:14] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 90171080 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:26] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67550320 and 27 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:50] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 544343168 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:08] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1368540504 and 441 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:16] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5240 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:40] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1208687112 and 74 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:12] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 83136 and 129 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:20] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 286680240 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:22] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 39384 and 138 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:30] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74616 and 146 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:40] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 307216 and 157 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:58] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 38512 and 175 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:36] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 88170680 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:46] okay, syncing the patch, but it needs a follow-up. Doing obth [12:46:22] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 431375520 and 23 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:47:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ee1e40061ac4d52a90f0d44c08f1665aed83a618: Assign urlshortener-create-url permission (T229633) (duration: 01m 06s) [12:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:25] T229633: Convert use of $wgUrlShortenerReadOnly to urlshortener-create-url in production - https://phabricator.wikimedia.org/T229633 [12:48:15] (03CR) 10Ayounsi: [C: 03+2] Netbox scripts, set commit_default = False [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645131 (owner: 10Ayounsi) [12:49:39] (03CR) 10Alexandros Kosiaris: "> Patch Set 5:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [12:49:47] (03CR) 10JMeybohm: [C: 04-1] "PCC fails with" [puppet] - 10https://gerrit.wikimedia.org/r/644545 (owner: 10Hnowlan) [12:49:53] (03PS1) 10Urbanecm: Revoke urlshortener-create-url from sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646661 (https://phabricator.wikimedia.org/T229633) [12:50:03] (03CR) 10Urbanecm: [C: 03+2] Revoke urlshortener-create-url from sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646661 (https://phabricator.wikimedia.org/T229633) (owner: 10Urbanecm) [12:50:22] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 22272 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:54] (03Merged) 10jenkins-bot: Revoke urlshortener-create-url from sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646661 (https://phabricator.wikimedia.org/T229633) (owner: 10Urbanecm) [12:51:30] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36440 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:30] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36440 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:36] good, now it works [12:52:08] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 15648 and 144 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:20] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36088 and 157 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:26] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4224 and 162 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:53:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5691a397a9de05deddea94318dc6fa6c59c44833: Revoke urlshortener-create-url from sysops (T229633) (duration: 01m 06s) [12:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:10] T229633: Convert use of $wgUrlShortenerReadOnly to urlshortener-create-url in production - https://phabricator.wikimedia.org/T229633 [12:53:36] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 37920 and 234 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:56:04] (03PS1) 10Matthias Mullie: Add global to indicate that elastic LTR features are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663 [12:56:55] * Urbanecm is done [12:57:01] effie: it seems it worked as expected - thanks! [12:57:10] !log EU B&C window done [12:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:54] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:54] cheers [13:02:01] (03PS2) 10Hnowlan: maps: fix typo in postgres command, retry 5 times before alerting [puppet] - 10https://gerrit.wikimedia.org/r/644545 [13:04:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This LGTM, I 've left a couple of very minor comments around though." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [13:06:00] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26993/console" [puppet] - 10https://gerrit.wikimedia.org/r/644545 (owner: 10Hnowlan) [13:07:09] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [13:11:15] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644545 (owner: 10Hnowlan) [13:11:26] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [13:12:12] jouncebot: now [13:12:12] No deployments scheduled for the next 4 hour(s) and 47 minute(s) [13:12:13] !log Upgrading Jenkins 2.252 > 2.263.1 on contint2001 / contint1001 [13:12:14] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10JMeybohm) 05Open→03Resolved [13:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:33] (03CR) 10Effie Mouzeli: [C: 04-1] "This is not working as it should" [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [13:12:40] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [13:12:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Stop installing apt-transport-https on Buster [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [13:13:00] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [13:13:29] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10JMeybohm) 05Resolved→03Open [13:15:46] (03CR) 10JMeybohm: [C: 03+2] Remove calico::builder [puppet] - 10https://gerrit.wikimedia.org/r/645078 (https://phabricator.wikimedia.org/T266893) (owner: 10JMeybohm) [13:17:34] (03CR) 10JMeybohm: [C: 03+2] Don't ship any default config files with the packages [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/642011 (owner: 10JMeybohm) [13:18:23] (03PS3) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [13:18:31] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:15] ^ That's me. No effect on Production (since traffic isn't switch yet). [13:20:43] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:20:47] (03PS2) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:22] (03PS4) 10JMeybohm: admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) [13:22:31] (03CR) 10jerkins-bot: [V: 04-1] int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (owner: 10Kormat) [13:23:38] !log Stopping CI Jenkins for upgrade [13:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:06] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:26] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 65365680 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:35:58] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 792 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:37:49] !Deployed apertium service to eqiad and codfw (T255672) [13:37:50] T255672: Migrate apertium to the deployment pipeline - https://phabricator.wikimedia.org/T255672 [13:38:24] !log Deployed apertium service to eqiad and codfw (T255672) [13:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:32] (oops!) [13:40:14] (03PS2) 10Ppchelko: Remove wgParserCacheUseJson setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) [13:48:12] Pchelolo: should we get the ParserOutput revert in 1.36.0-wmf.20 ? ( https://gerrit.wikimedia.org/r/c/mediawiki/core/+/645157 ) [13:48:27] hashar: that will fix wmf.20, yes [13:48:46] Pchelolo: I am cherry picking it / +2ing and pulling on deployment server [13:49:01] hashar: there's a cherry-pick already [13:49:07] ah great [13:49:12] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:12] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/645312 [13:49:34] (03CR) 10Hashar: [C: 03+2] Revert "Hard-deprecate all public property access on CacheTime and ParserOutput." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 (https://phabricator.wikimedia.org/T269396) (owner: 10Daniel Kinzler) [13:49:57] I should have done those preparation tasks on friday [13:50:14] sorry for this mess.. the more tests we add for this serialization stuff the deeper the problems with it go.. Will wait till fully migrated to JSON to continue messing with it [13:50:23] 2 UBNs is 2-too-many [13:54:15] Pchelolo: well as we say in french, "you can not cook an omelette without breaking eggs" [13:54:20] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [13:54:27] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:54:32] and my understanding is that switching to json serialization aims at addressing that exact problem [13:54:53] which has hit us more than a few times in the last few years ;] so I really welcome the refactoring effort on that front! [13:55:55] (03PS3) 10Alexandros Kosiaris: recommendation-api: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/641938 (https://phabricator.wikimedia.org/T241230) [13:55:57] (03PS4) 10Alexandros Kosiaris: recommendation-api: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/641939 (https://phabricator.wikimedia.org/T241230) [13:55:59] (03PS1) 10Alexandros Kosiaris: apertium: Add kubernetes as backend for traffic [puppet] - 10https://gerrit.wikimedia.org/r/646673 [13:57:19] hashar: yeah, everything is being written in JSON now, just need to wait out the remaining php serialized items [13:57:36] \o/ [13:58:23] (03PS3) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 [14:01:38] (03CR) 10jerkins-bot: [V: 04-1] int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (owner: 10Kormat) [14:04:30] (03PS4) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 [14:05:20] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [14:16:39] (03PS1) 10Ppchelko: Enable OldRevisionParserCache on labs and group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646679 (https://phabricator.wikimedia.org/T268075) [14:17:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645444 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [14:18:37] (03PS1) 10Filippo Giunchedi: alertmanager: add custom email template [puppet] - 10https://gerrit.wikimedia.org/r/646680 (https://phabricator.wikimedia.org/T267018) [14:19:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645445 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [14:20:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645446 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [14:21:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645443 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [14:22:32] (03Merged) 10jenkins-bot: Revert "Hard-deprecate all public property access on CacheTime and ParserOutput." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 (https://phabricator.wikimedia.org/T269396) (owner: 10Daniel Kinzler) [14:22:55] (03PS4) 10Urbanecm: Initial configuration for skrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643524 (https://phabricator.wikimedia.org/T268410) [14:23:51] (03PS1) 10Vlad.shapik: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into T269152-oauth-2-0-refresh-tokens-expire-after-1-minute Change-Id: I5b09898062babb919245973f7a77b2e51b76e684 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646682 [14:24:42] (03Abandoned) 10Vlad.shapik: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into T269152-oauth-2-0-refresh-tokens-expire-after-1-minute Change-Id: I5b09898062babb919245973f7a77b2e51b76e684 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646682 (owner: 10Vlad.shapik) [14:25:13] Pchelolo: it merged. I am deploying it on upgrading the cluster [14:25:54] (03PS5) 10Urbanecm: Initial configuration for skrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643524 (https://phabricator.wikimedia.org/T268410) [14:25:58] (03CR) 10Ottomata: kafka: make alerts not critical for test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [14:26:14] (03PS4) 10Urbanecm: Initial configuration for skrwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) [14:26:21] hashar: no way to force-test it in prod unfortunately, all new renders I create will be serialized with JSON and the problem will not occur [14:27:13] (03PS2) 10Vlad.shapik: CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) [14:27:19] (03CR) 10Urbanecm: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) (owner: 10Urbanecm) [14:27:59] (03PS5) 10Urbanecm: Initial configuration for skrwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) [14:29:11] (03PS6) 10Urbanecm: Initial configuration for skrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643524 (https://phabricator.wikimedia.org/T268410) [14:29:21] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for skrwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) (owner: 10Urbanecm) [14:30:36] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for skrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643524 (https://phabricator.wikimedia.org/T268410) (owner: 10Urbanecm) [14:31:06] (03CR) 10Vlad.shapik: "Have a look at new settings, please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) (owner: 10Vlad.shapik) [14:33:30] Pchelolo: thanks ;) [14:33:56] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.20/includes: Applying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/645312 T2569396 (duration: 01m 15s) [14:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:46] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646686 [14:34:48] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646686 (owner: 10Hashar) [14:34:57] promoting promoting [14:36:04] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646686 (owner: 10Hashar) [14:37:51] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.20 [14:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:14] (03PS1) 10Alexandros Kosiaris: mobileapps: Remove the nontls release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646687 [14:38:16] (03PS1) 10Alexandros Kosiaris: apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) [14:38:58] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.20 (duration: 01m 06s) [14:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:49] (03CR) 10jerkins-bot: [V: 04-1] apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [14:40:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Remove the nontls release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646687 (owner: 10Alexandros Kosiaris) [14:41:51] (03Merged) 10jenkins-bot: mobileapps: Remove the nontls release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646687 (owner: 10Alexandros Kosiaris) [14:42:47] (03PS1) 10Matthias Mullie: Remove license map from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646690 (https://phabricator.wikimedia.org/T257938) [14:43:04] (03CR) 10Matthias Mullie: [C: 04-1] "DNM until after I496b67785f766fe72088b16461448653d2a6ffa7 is deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646690 (https://phabricator.wikimedia.org/T257938) (owner: 10Matthias Mullie) [14:43:44] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646692 [14:43:46] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646692 (owner: 10Hashar) [14:43:51] stupid script ... [14:43:54] :/ [14:44:40] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646692 (owner: 10Hashar) [14:45:36] (03CR) 10Urbanecm: [V: 03+2] ""expected" failure, the patch is correct, see T269589 for a long term fix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) (owner: 10Urbanecm) [14:46:29] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.20 [14:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:13] (03PS1) 10Hashar: all wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646693 [14:47:15] (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646693 (owner: 10Hashar) [14:48:28] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646693 (owner: 10Hashar) [14:50:15] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.20 [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] PROBLEM - puppet last run on parse2012 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:02:22] RECOVERY - puppet last run on parse2012 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:02:46] (03PS7) 10Urbanecm: Initial configuration for skrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643524 (https://phabricator.wikimedia.org/T268410) [15:03:11] (03PS8) 10Urbanecm: Initial configuration for skrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643524 (https://phabricator.wikimedia.org/T268410) [15:06:23] (03PS6) 10Urbanecm: Initial configuration for skrwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643526 (https://phabricator.wikimedia.org/T268448) [15:08:58] (03PS1) 10Urbanecm: Add skrwiki and skrwiktionary to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646732 (https://phabricator.wikimedia.org/T268448) [15:10:21] (03CR) 10jerkins-bot: [V: 04-1] Add skrwiki and skrwiktionary to rtl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646732 (https://phabricator.wikimedia.org/T268448) (owner: 10Urbanecm) [15:10:52] (03CR) 10Urbanecm: [V: 03+2] "this failure is expected, and caused by T269589" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646732 (https://phabricator.wikimedia.org/T268448) (owner: 10Urbanecm) [15:18:27] (03PS1) 10Urbanecm: Initial configuration for eowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646734 (https://phabricator.wikimedia.org/T269426) [15:22:38] (03CR) 10Urbanecm: [C: 04-1] "disable local uploads" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646734 (https://phabricator.wikimedia.org/T269426) (owner: 10Urbanecm) [15:25:55] (03PS1) 10Urbanecm: Initial configuration for wawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646736 (https://phabricator.wikimedia.org/T269431) [15:27:14] (03PS2) 10Urbanecm: Initial configuration for eowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646734 (https://phabricator.wikimedia.org/T269426) [15:36:39] (03CR) 10Ottomata: [C: 03+1] admin: add jakob to analytics groups (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/645372 (https://phabricator.wikimedia.org/T269444) (owner: 10Ssingh) [15:37:02] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: fix typo in postgres command, retry 5 times before alerting [puppet] - 10https://gerrit.wikimedia.org/r/644545 (owner: 10Hnowlan) [15:37:43] (03PS5) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [15:37:45] (03PS1) 10JMeybohm: calico: Remove calico/data [puppet] - 10https://gerrit.wikimedia.org/r/646740 (https://phabricator.wikimedia.org/T267653) [15:38:50] (03PS2) 10Ssingh: admin: add jakob to analytics groups (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/645372 (https://phabricator.wikimedia.org/T269444) [15:39:10] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:34] (03CR) 10Ssingh: [C: 03+2] admin: add jakob to analytics groups (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/645372 (https://phabricator.wikimedia.org/T269444) (owner: 10Ssingh) [15:41:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10ssingh) 05Open→03Resolved Hi @Jakob_WMDE: This request has been merged and you should have received an email with the Kerberos password. Please let us... [15:41:53] !log installing vips security updates [15:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:17] (03PS6) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [15:54:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:57] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [15:59:20] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [15:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [16:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:19] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:16] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [16:04:17] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:21] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [16:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:43] !log updated buster installation image to 10.7 T269558 [16:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:50] T269558: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 [16:16:52] 10Operations, 10observability: Increased icinga check latency since 05/12 - https://phabricator.wikimedia.org/T269560 (10lmata) a:03colewhite [16:20:52] 10Operations, 10Discovery-Search, 10Elasticsearch: Port elasticsearch support scripts to cookbooks - https://phabricator.wikimedia.org/T269218 (10CBogen) p:05Triage→03Medium [16:21:43] 10Operations, 10Discovery-Search, 10Elasticsearch: Port elasticsearch support scripts to cookbooks - https://phabricator.wikimedia.org/T269218 (10Gehel) Note that cookbooks can only be run by SREs, while es-tool can be run by anyone with access to the elastisearch servers. At least functionalities that might... [16:28:47] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Gehel) @Cmjohnson did you receive any news from Dell. [16:33:43] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) a:03klausman [16:33:52] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) p:05Triage→03Medium [16:35:05] (03PS1) 10Effie Mouzeli: hiera: remove shard17 from redis, reimage mc1035/mc2035 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) [16:35:26] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [16:35:26] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [16:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:58] (03CR) 10jerkins-bot: [V: 04-1] hiera: remove shard17 from redis, reimage mc1035/mc2035 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [16:38:13] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Migrate WDQS to Debian Buster - https://phabricator.wikimedia.org/T244753 (10CBogen) [16:38:32] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime [16:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:31] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:40:33] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105 (10colewhite) StatsV metrics are in Prometheus now. I think what's left to do is to update the dashboards to use Thanos. [16:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:49] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [16:41:51] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [16:42:13] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) Given that the total size of our redis cluster is now ~2G ( as discussed in T252391#6647730) we can upgrade a few more memc... [16:42:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:04] !log deployment-cache-text06: downgrade varnish to 5.2.1-1wm1 T264398 [16:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:11] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [16:43:26] (03PS2) 10Effie Mouzeli: hiera: remove shard17 from redis, reimage mc1035/mc2035 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) [16:44:55] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [16:44:55] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [16:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:48] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353 (10thcipriani) [16:48:20] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10thcipriani) [16:48:43] (03PS3) 10Effie Mouzeli: hiera: remove shard17 from redis, reimage mc1035/mc2035 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) [16:48:59] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) It took a while to package and rebuild libvmod-re2, libvmod-netmapper, varnish-modules, and varnishkafka against varnish 5.2.1. Long sto... [16:57:19] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10hdothiduc) @Dzahn Great, let's do it! Let me and/or @Varnent know if you need anything! [16:58:36] (03PS2) 10Alexandros Kosiaris: apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) [16:58:38] (03PS1) 10Alexandros Kosiaris: Remove values-nontls.yaml for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/646754 [17:00:00] (03CR) 10jerkins-bot: [V: 04-1] apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [17:05:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Remove calico/data [puppet] - 10https://gerrit.wikimedia.org/r/646740 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [17:06:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/641938 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [17:13:41] (03CR) 10RLazarus: [C: 03+1] hiera: remove shard17 from redis, reimage mc1035/mc2035 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [17:18:18] !log disable puppet on mc1035, mc2035 for 646751 [17:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:53] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.37:9632]) https://wikitech.wikimedia.org/wiki/PyBal [17:25:14] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26996/console" [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [17:26:06] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.37:9632]) https://wikitech.wikimedia.org/wiki/PyBal [17:26:41] akosiaris: is that you maybe? [17:26:57] cleaning up rec-api yes [17:27:03] ack [17:27:04] (03PS1) 10Razzi: kafka: configure kafka test broker [puppet] - 10https://gerrit.wikimedia.org/r/646757 (https://phabricator.wikimedia.org/T268202) [17:35:42] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:37:00] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:37:31] !log cleanup the old recommendation-api non TLS LVS service [17:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:51] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [17:47:31] jouncebot: now [17:47:31] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [17:47:34] jouncebot: next [17:47:34] In 0 hour(s) and 12 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1800) [17:49:22] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T269610 (10Michael) [17:49:24] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigrations (duration: 01m 01s) [17:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:43] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Michael) [17:52:06] (03CR) 10Herron: [C: 03+1] "LGTM overall (in agreement with the previous comments) thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) (owner: 10Cwhite) [17:54:25] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigrations (duration: 00m 59s) [17:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] (03CR) 10Muehlenhoff: "The $version_override also needs to be passed to the profile:.redis::multi_instance define, then it should work." [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [17:56:26] (03PS4) 10MSantos: WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [17:57:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) cloudvirt1027 and 1028 are still showing firmware version 21.40.20.00. I can't easily check 1029 or 1030 since they're o... [17:57:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) 05Resolved→03Open [17:57:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:58:00] (03CR) 10jerkins-bot: [V: 04-1] WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:58:02] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hiera: remove shard17 from redis, reimage mc1035/mc2035 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [17:59:03] (03PS5) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [18:00:04] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1800). [18:00:51] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [18:02:50] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigrations (duration: 00m 58s) [18:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:22] (03CR) 10Herron: [C: 03+1] profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:10:35] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2035.codfw.wmnet ` The log... [18:11:57] (03PS6) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [18:12:13] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigrations (duration: 00m 57s) [18:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:22] (03CR) 10Herron: [C: 03+1] alertmanager: add custom email template [puppet] - 10https://gerrit.wikimedia.org/r/646680 (https://phabricator.wikimedia.org/T267018) (owner: 10Filippo Giunchedi) [18:13:08] (03CR) 10Ottomata: [C: 03+1] kafka: configure kafka test broker [puppet] - 10https://gerrit.wikimedia.org/r/646757 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:13:23] (03CR) 10Ottomata: [C: 03+1] "You'll need to include the mirror profile in the role too, right?" [puppet] - 10https://gerrit.wikimedia.org/r/646757 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:13:26] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [18:14:33] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Ottomata) APPROVED. Michael will need to be in either the `wmf` or `nda` ldap groups, and should also be given a Kerberos principal. [18:16:19] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Dzahn) wmde users are normally not added to the "wmf" group but to the "wmde" and "nda" groups instead [18:16:42] (03CR) 10Razzi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/646757 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:17:33] (03CR) 10Herron: [C: 03+1] "LGTM, optional nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645200 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:17:38] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10WMDE-leszek) I approve this request on WMDE end as well. [18:19:40] PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:20:08] (03CR) 10Ottomata: [C: 03+1] "Ah, k" [puppet] - 10https://gerrit.wikimedia.org/r/646757 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:22:00] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Migrate WDQS to Debian Buster - https://phabricator.wikimedia.org/T244753 (10CBogen) [18:22:30] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Migrate WDQS to Debian Buster - https://phabricator.wikimedia.org/T244753 (10CBogen) a:03RKemper [18:25:04] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2035.codfw.wmnet'] ` Of which those **FAILED**: ` ['mc2035.codfw.wmnet'... [18:25:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26997/kraz.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645443 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:26:01] (03PS2) 10Dzahn: mw_rc_irc: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645443 (https://phabricator.wikimedia.org/T266479) [18:26:31] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [18:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:58] (03CR) 10Dzahn: "noop on kraz.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/645443 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:28:33] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:02] PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:29:21] (03CR) 10Dzahn: "somewhat expecting this discussion is why I wanted to separate the code change that makes it possible to change it from the actual change " [puppet] - 10https://gerrit.wikimedia.org/r/645431 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [18:34:28] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2035.codfw.wmnet ` The log... [18:34:31] (03PS2) 10Nray: Remove Growth Study Screener Quick Survey Config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645425 (https://phabricator.wikimedia.org/T269369) [18:34:53] (03PS3) 10Nray: Remove Growth Study Screener Quick Survey Config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645425 (https://phabricator.wikimedia.org/T269369) [18:36:12] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1035 site=eqiad tunnel=mc2035_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [18:37:29] (03CR) 10Dzahn: "one of the classes is on puppetmasters, the other on everything https://puppet-compiler.wmflabs.org/compiler1002/26998/puppetmaster2001.co" [puppet] - 10https://gerrit.wikimedia.org/r/645444 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:37:51] (03PS2) 10Nray: Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) (owner: 10Jdlrobson) [18:37:53] (03CR) 10Bstorm: [C: 03+1] haproxy: reduce log persistence to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/645096 (https://phabricator.wikimedia.org/T269252) (owner: 10Andrew Bogott) [18:40:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [18:41:14] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:59] PROBLEM - Host mc2035 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:01] !log ryankemper@mwmaint1002 conftool action : set/pooled=yes:weight=10; selector: service=wdqs-internal,name=wdqs1011.eqiad.wmnet [18:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:51] (03PS1) 10Andrew Bogott: Move some project backups to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/646797 (https://phabricator.wikimedia.org/T260692) [18:49:53] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [18:49:57] !log ryankemper@mwmaint1002 conftool action : set/pooled=yes:weight=10; selector: name=wdqs1011.eqiad.wmnet [18:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:09] (03CR) 10Nray: [C: 03+1] Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) (owner: 10Jdlrobson) [18:50:11] !log T246345 Brought new `wdqs-internal` node `wdqs1011` into service: `sudo confctl select 'name=wdqs1011.eqiad.wmnet' set/pooled=yes:weight=10` [18:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:18] T246345: Service implementation on wdqs101[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T246345 [18:51:04] (03CR) 10Andrew Bogott: [C: 03+2] Move some project backups to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/646797 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [18:51:35] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:44] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:50] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [18:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:56] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [18:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:01] RECOVERY - Host mc2035 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [18:52:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:54] !log systemctl restart icinga on alert1001 T269560 [18:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:01] T269560: Increased icinga check latency since 05/12 - https://phabricator.wikimedia.org/T269560 [18:54:17] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) cloudvirt1027: ` Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0C 21.40.20.00 Broadcom Adv. Dual 10Gb Ethernet - BC:9... [18:57:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:57:08] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:15] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:21] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:47] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26999/" [puppet] - 10https://gerrit.wikimedia.org/r/645444 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:59:56] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2035.codfw.wmnet'] ` and were **ALL** successful. [19:00:03] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1035 site=eqiad tunnel=mc2035_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [19:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T1900). [19:00:04] nray and Pchelolo: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:14] o/ here and ready [19:00:56] I can deploy today! [19:01:24] (03CR) 10Urbanecm: [C: 03+2] Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) (owner: 10Jdlrobson) [19:01:38] (03PS3) 10Urbanecm: labs: Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) (owner: 10Jdlrobson) [19:01:47] (03CR) 10Urbanecm: [C: 03+2] labs: Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) (owner: 10Jdlrobson) [19:02:16] nray: the labs-only patch will be deployed automatically within 30 minutes (feel free to ping me/create a task if it doesn't happen) [19:02:28] Urbanecm: cool sounds good [19:02:34] (03Merged) 10jenkins-bot: labs: Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) (owner: 10Jdlrobson) [19:02:39] (03CR) 10Urbanecm: [C: 03+2] Remove Growth Study Screener Quick Survey Config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645425 (https://phabricator.wikimedia.org/T269369) (owner: 10Nray) [19:03:00] nray: for the production patch, are you able to test it at mwdebug? [19:03:10] yes [19:03:17] good, I'll ping you once it's ready then :) [19:03:38] (03Merged) 10jenkins-bot: Remove Growth Study Screener Quick Survey Config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645425 (https://phabricator.wikimedia.org/T269369) (owner: 10Nray) [19:04:22] nray: your patch is available at mwdebug1001, please test, and let me know :) [19:04:59] thank you, testing now! [19:06:29] Urbanecm: looks great, you can proceed! [19:06:36] nray: thanks, syncing! [19:07:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) I am no longer certain I did any of these right, so I'm now logging into the entire group and rechecking: cloudvirt1027... [19:08:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3ba3af905251badeb546c17f996f0860a69024a1: Remove Growth Study Screener Quick Survey Config (T269369) (duration: 01m 02s) [19:08:12] nray: should be done [19:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:16] T269369: Turn off Growth quicksurvey on enwiki - https://phabricator.wikimedia.org/T269369 [19:08:31] awesome, thank you for your help Urbanecm ! [19:08:37] no problem! [19:08:57] Pchelolo: the floor is yours [19:09:44] thank you. [19:10:17] I'll need quite some time, so if anyone needs to do anything first? [19:11:25] (03PS2) 10Ppchelko: Enable OldRevisionParserCache on labs and group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646679 (https://phabricator.wikimedia.org/T268075) [19:11:35] (03CR) 10Ppchelko: [C: 03+2] Enable OldRevisionParserCache on labs and group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646679 (https://phabricator.wikimedia.org/T268075) (owner: 10Ppchelko) [19:12:49] (03Merged) 10jenkins-bot: Enable OldRevisionParserCache on labs and group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646679 (https://phabricator.wikimedia.org/T268075) (owner: 10Ppchelko) [19:13:17] RECOVERY - Disk space on Hadoop worker on an-worker1101 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:14:57] Pchelolo: nope :) [19:19:48] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:646679 Enable OldRevisionParserCache on labs and group0 (duration: 01m 00s) [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:03] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:646679 Enable OldRevisionParserCache on labs and group0, CS.php (duration: 00m 59s) [19:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27001/deploy2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645442 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:23:57] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1035.eqiad.wmnet ` The log... [19:24:00] (03CR) 10Dzahn: "noop on deploy1001 - deploy1002 is buster and has unrelated scap issue" [puppet] - 10https://gerrit.wikimedia.org/r/645442 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:24:45] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [19:27:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) cloudvirt10[25-30] firmware re-check. It seems, depending on which file you download, it updates the 1G or the 10G, but no... [19:29:44] we are replacing require_package with ensure_packages. some changes like the one to the ipmi module are applied on everything. Puppet compiled though on one for each role to check for any issues related to dependencies and it actually removes the requirement for packages to exist before everything else [19:32:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27000/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645445 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:33:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) a:05RobH→03Andrew @andrew reassigning this back to you so you are aware of the discrepancy on cloudvirt102[56]. since... [19:36:29] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10wiki_willy) a:03Papaul [19:36:31] (03CR) 10Dzahn: "noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/645445 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:37:30] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [19:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:44] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10ssingh) a:03ssingh [19:38:10] !log T269204 reimaging the following instances to debian buster => `eqiad public`:`wdqs1006`, `codfw public`:`wdqs2003`, `codfw internal`:`wdqs2006`, `test`:`wdqs1009` [19:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:17] T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204 [19:39:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:47] (03PS2) 10Dzahn: apt::repository: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645446 (https://phabricator.wikimedia.org/T266479) [19:40:34] 10Operations, 10ops-codfw, 10DC-Ops: codfw: Netbox Errors - https://phabricator.wikimedia.org/T269621 (10wiki_willy) [19:46:55] (03PS1) 10Ssingh: admin: upgrade migr from ldap_only_users to shell access for analytics [puppet] - 10https://gerrit.wikimedia.org/r/646815 (https://phabricator.wikimedia.org/T269610) [19:47:33] PROBLEM - ensure kvm processes are running on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:47:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10ssingh) [19:48:31] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:32] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1035.eqiad.wmnet'] ` and were **ALL** successful. [19:48:33] PROBLEM - very high load average likely xfs on ms-be1032 is CRITICAL: CRITICAL - load average: 157.49, 209.54, 145.52 https://wikitech.wikimedia.org/wiki/Swift [19:49:33] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott canary VM is taking a long time to start! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:20] (03PS6) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) [19:52:07] (03CR) 10jerkins-bot: [V: 04-1] ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [19:52:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,blazegraph} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:53:02] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [19:53:33] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [19:53:35] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [19:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:48] (03PS7) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) [19:53:50] (03PS1) 10Razzi: kafka: allow accessing kafka-jumbo from kafka-test [puppet] - 10https://gerrit.wikimedia.org/r/646819 (https://phabricator.wikimedia.org/T268202) [19:54:00] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [19:55:33] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [19:55:34] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:39] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [19:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:21] PROBLEM - MD RAID on ms-be1032 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:57:22] ACKNOWLEDGEMENT - MD RAID on ms-be1032 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269624 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:57:25] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T269624 (10ops-monitoring-bot) [19:57:36] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:03] 10Operations, 10ops-codfw, 10DC-Ops: codfw: Netbox Errors - https://phabricator.wikimedia.org/T269621 (10RobH) 05Open→03Resolved Updated the google sheet, errors cleared. [19:58:35] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:39] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:58] PROBLEM - Disk space on ms-be1032 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb4 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1032&var-datasource=eqiad+prometheus/ops [20:01:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:04:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:05:40] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01064 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:06:34] PROBLEM - Host wdqs2003 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:06] RECOVERY - Host wdqs2003 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [20:07:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:07:55] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27003/" [puppet] - 10https://gerrit.wikimedia.org/r/645446 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:09:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:15:11] RECOVERY - ensure kvm processes are running on cloudvirt1027 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:15:18] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:45] taking a look which timer failed there ^ [20:16:15] !log mwmaint1002 - mediawiki_job_wikidata-updateQueryServiceLag job failed to run [20:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:19] ryankemper: I guess it makes sense that "job_wikidata-updateQueryServiceLag" could not run during current work [20:17:24] RECOVERY - very high load average likely xfs on ms-be1032 is OK: OK - load average: 9.64, 8.26, 28.93 https://wikitech.wikimedia.org/wiki/Swift [20:19:46] mutante: yeah, I'm not familiar with how the job works specifically but that would make sense [20:20:42] if it's supposed to alert at <50% availability then that might be a bit unexpected because only one node in each `dc x [internal, external]` is being re-imaged at a time [20:21:12] So for codfw for example there's one codfw wdqs-internal host that would be unable to report and one codfw external wdqs host [20:21:48] ryankemper: no, this case is not about availability, it's "one of the mediawiki maintenance 'crons' (that are now systemd timers) failed to run on the maintenance servers [20:22:24] (03CR) 10Ottomata: [C: 03+1] kafka: allow accessing kafka-jumbo from kafka-test [puppet] - 10https://gerrit.wikimedia.org/r/646819 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:26:03] ryankemper: what happens is: "maintenance job tries to update what the current lag is like.. tries to get lag data from prometheus and that fails. now since it's a systemd timer and not a cron it means it's a failed service which then turns into an Icinga alert about "systemd state is bad on a mwmaint server". and nothing clears it.. but this job runs every minute.. so it's more alert than [20:26:09] is appropriate [20:26:35] let me just clear that failed service and wait a minute [20:27:14] !log mwmaint1002 - systemctl reset-failed to clear icinga alert [20:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] (03CR) 10Razzi: [C: 03+2] kafka: allow accessing kafka-jumbo from kafka-test [puppet] - 10https://gerrit.wikimedia.org/r/646819 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:30:50] (03CR) 10Razzi: [C: 03+2] kafka: configure kafka test broker [puppet] - 10https://gerrit.wikimedia.org/r/646757 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:35:29] (03PS1) 10Effie Mouzeli: hiera: tune memcached 1.5x hosts [puppet] - 10https://gerrit.wikimedia.org/r/646829 [20:37:40] (03CR) 10Dzahn: [V: 03+1] "seems this is only used here https://openstack-browser.toolforge.org/puppetclass/role::dnsbox ?" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:39:28] (03CR) 10Dzahn: [V: 03+1] "cloud instance has pre-broken puppet? https://puppet-compiler.wmflabs.org/compiler1002/27006/traffic-dnsbox.traffic.eqiad.wmflabs/change.t" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:40:55] (03PS9) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) [20:42:50] (03CR) 10RLazarus: [C: 03+1] hiera: tune memcached 1.5x hosts [puppet] - 10https://gerrit.wikimedia.org/r/646829 (owner: 10Effie Mouzeli) [20:43:51] (03PS2) 10Effie Mouzeli: hiera: tune memcached 1.5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/646829 [20:45:09] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: tune memcached 1.5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/646829 (owner: 10Effie Mouzeli) [20:47:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:50:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:50:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) 05Open→03Resolved looks good! [20:58:39] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:58:41] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:48] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:58:49] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:57] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T2100). [21:01:01] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:03] (03CR) 10Dzahn: "This one also switches from "puppet" to "root" user but that is not something you or I did on purpose?" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:04:25] (03PS1) 10Urbanecm: Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693) [21:05:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:13:28] (03PS10) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) [21:32:24] 10Operations, 10DNS, 10Traffic: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Dzahn) >>! In T263518#6482600, @Volans wrote: > The erros seems to be caused by the lack of the entry related to `releases` in the `discovery-states` file in gdnsd configuration. The reverte... [21:40:40] (03CR) 10Dzahn: [C: 03+1] "looks good to me. migr is existing LDAP user in wmde/nda and upgrading to shell user" [puppet] - 10https://gerrit.wikimedia.org/r/646815 (https://phabricator.wikimedia.org/T269610) (owner: 10Ssingh) [21:41:07] 10Operations, 10observability: Increased icinga check latency since 05/12 - https://phabricator.wikimedia.org/T269560 (10colewhite) Starting around 1400 UTC today, average check latency has been dropping steadily. # Puppet changes prior to and around that time do not correlate with symptoms (increasing check... [21:42:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Dzahn) >>! In T269610#6674098, @Dzahn wrote: > wmde users are normally not added to the "wmf" group but to the "wmde" and "nda" groups instead... [21:48:34] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10CRoslof) The change to the nameservers should now be completed. I submitted the nameserver change request on November 16, but it took a while for the .tr registr... [22:00:05] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201207T2200). [22:06:50] 10Operations, 10DNS, 10Traffic: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10BBlack) There are comments at the top of the DNS repo's `utils/mock_etc/discovery-geo-resources` and `utils/mock_etc/discovery-metafo-resources` about avoiding this scenario by updating things... [22:08:10] (03CR) 10Ssingh: [C: 03+2] admin: upgrade migr from ldap_only_users to shell access for analytics [puppet] - 10https://gerrit.wikimedia.org/r/646815 (https://phabricator.wikimedia.org/T269610) (owner: 10Ssingh) [22:09:16] 10Operations, 10DNS, 10Traffic: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10BBlack) (I'm guessing they should probably be updated to the correct file, and also to mention that it has to be in `state: production` before deploying the DNS mock_etc part of things, but I'... [22:11:29] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10ssingh) 05Open→03Resolved Thanks @Dzahn for th review and the additional context. @Michael: The request has been merged and you should have received the Kerberos... [22:17:09] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:17:12] (03PS10) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [22:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:21:22] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:39] (03CR) 10Dzahn: [C: 03+2] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [22:21:47] (03CR) 10Dzahn: [C: 03+2] "cloud only" [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [22:24:28] (03CR) 10Cwhite: [C: 03+2] profile: make a logstash templates directory and relocate existing templates [puppet] - 10https://gerrit.wikimedia.org/r/645200 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:29:50] Hey all - scapping out patch for T120883 to .20 right now... [22:30:01] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:34] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:44] !log Deployed security patch for T120883 [22:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:50] ugh, need to revert this ^ [22:36:23] ^ was just about to say that. Contribs are borked. [22:37:02] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 559 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:37:44] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.246 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:38:06] Scapping out old files now, apologies. [22:39:24] !log Undeployed security patch for T120883 as it caused several errors [22:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:13] The T120883 issues should be fixed now - apologies for the explosions. [22:42:22] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 12.56 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:45:19] does syncing a patch cause read only mode while the sync is in progress? https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Read_only_mode [22:46:22] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:47:15] DannyS712: I'm not caught up on T120883 at all, but in general, no [22:48:06] !log Re-deployed security patch for T120883 (v2) [22:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:59] rzl I just got read only when trying to save on enwiki, so if its not the re-deployment ^ then what could have caused it? [22:50:42] DannyS712: can you paste the error message you got, please? [22:51:24] oops, already closed it, will try and find it though [22:51:35] (03PS1) 10Dzahn: wikistats: double-escape $ and % in systemd timer commandline [puppet] - 10https://gerrit.wikimedia.org/r/646857 [22:51:41] edit rate overall looks unaffected, so I don't think we're seeing anything widespread [22:52:50] (are or were, I should say -- it didn't dip after the bad deploy either) [22:53:36] (03PS2) 10Dzahn: wikistats: double-escape $ and % in systemd timer commandline [puppet] - 10https://gerrit.wikimedia.org/r/646857 [22:54:45] (03CR) 10Dzahn: [C: 03+2] "https://github.com/systemd/systemd/issues/2146" [puppet] - 10https://gerrit.wikimedia.org/r/646857 (owner: 10Dzahn) [22:55:14] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f8aecc2b4e0: Failed to establish a new connection: [Errno 111] Connection [22:55:14] ://wikitech.wikimedia.org/wiki/Search%23Administration [22:56:16] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [22:56:46] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: number_of_nodes: 6, cluster_name: production-logstash-codfw, initializing_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_primary_shards: 456, number_of_data_nodes: 3, unassigned_shards: 0, active_shards_percent_as_number: 100.0, status: green, delayed_unassigned_shards: [22:56:46] ng_in_queue_millis: 0, timed_out: False, relocating_shards: 0, active_shards: 862 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:57:13] can confirm I could edit my talk page on en.wp [22:57:52] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [23:04:24] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [23:04:46] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [23:06:00] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [23:06:20] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [23:07:03] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:07:09] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:16] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:28] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:37] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:46] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:56] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:23] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:52] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:09] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:56] PROBLEM - SSH on logstash1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:24:40] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [23:25:22] RECOVERY - SSH on logstash1008 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:27:54] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [23:43:04] (03CR) 10Cwhite: [C: 03+1] alertmanager: add custom email template [puppet] - 10https://gerrit.wikimedia.org/r/646680 (https://phabricator.wikimedia.org/T267018) (owner: 10Filippo Giunchedi) [23:44:07] (03CR) 10Cwhite: [C: 03+2] profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:47:54] (03PS1) 10Cicalese: Configure API Portal permissions for launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646862 (https://phabricator.wikimedia.org/T267953) [23:52:10] (03PS3) 10Cwhite: profile: identify network devices logging input [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) [23:53:23] (03CR) 10Cwhite: profile: identify network devices logging input (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) (owner: 10Cwhite) [23:53:59] (03CR) 10Cwhite: [C: 03+2] add version into index pattern at build time [software/ecs] - 10https://gerrit.wikimedia.org/r/645214 (owner: 10Cwhite) [23:56:29] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)