[00:00:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi, brennen - https://phabricator.wikimedia.org/T270350 (10RLazarus) p:05Triage→03Medium [00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:02:19] (03CR) 10Bstorm: "Our plan is to try to use this tomorrow unless there's something wrong with it at this point 😊" [puppet] - 10https://gerrit.wikimedia.org/r/647815 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [00:03:07] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01183 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:03:56] (03PS6) 10Dzahn: parsoid: All profiles use_php=true, remove parameter [puppet] - 10https://gerrit.wikimedia.org/r/577043 (owner: 10C. Scott Ananian) [00:05:22] (03PS7) 10Dzahn: parsoid: All profiles use_php=true, remove parameter [puppet] - 10https://gerrit.wikimedia.org/r/577043 (https://phabricator.wikimedia.org/T233654) (owner: 10C. Scott Ananian) [00:06:13] bstorm: looks like widespread puppet failures related to wmf-pt-kill, is that your change? I haven't dug deep yet [00:06:21] Nope [00:06:37] (03CR) 10DannyS712: [C: 03+1] Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [00:06:39] kay thanks [00:06:40] That's T260511 [00:06:40] T260511: Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 [00:06:51] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:01] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/27164/" [puppet] - 10https://gerrit.wikimedia.org/r/577043 (https://phabricator.wikimedia.org/T233654) (owner: 10C. Scott Ananian) [00:07:21] rzl: basically it's a known issue for multiinstance replica servers we are working on [00:07:26] ahh okay [00:07:37] sorry, I saw the icinga alert and assumed something new had happened :) [00:07:39] Puppet still runs :) [00:07:46] just fails as well [00:07:52] (03CR) 10Dzahn: [C: 03+2] parsoid: All profiles use_php=true, remove parameter [puppet] - 10https://gerrit.wikimedia.org/r/577043 (https://phabricator.wikimedia.org/T233654) (owner: 10C. Scott Ananian) [00:07:53] haha honestly a mood [00:07:59] appreciate the context [00:08:05] np [00:08:23] haha [00:09:14] we are living on the edge. always just under the treshold for "widespread", so if you break just one machine you are the one who causes the alert. the other existing ones that are broken are getting away with it :) [00:10:13] maybe there should be a way to ack individual ones for the purpose of counting for this alert [00:10:30] yeah I only just drilled down on the dashboard -- whatever's newly failing, it's "misc" in codfw [00:10:42] or just back to the individual alerts per machine here on IRC..imho [00:10:57] so something *is* new, but there are enough unrelated wmf-pt-kill failures that that's what I saw first [00:11:04] that's how it was before we unified to the "widespread" one and you saw the host names right away [00:11:21] I'd be down with individual alerts per machine, once we have a way to aggregate them via alertmanager to reduce flooding [00:11:26] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi, brennen - https://phabricator.wikimedia.org/T270350 (10thcipriani) >>! In T270350#6697324, @RLazarus wrote: > @thcipriani Based on the context in T250241 I'm guessing this already has your blessing (... [00:11:52] so I am contributing 2 broken ones to the global count [00:12:02] deploy[12]002 with locked scap [00:12:12] which causes a puppet failure but .. not any real problem [00:12:39] there were a whole bunch of cloud-* though, too , right [00:13:28] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [00:14:17] yea, so if you follow the grafana link and then the puppetboard link [00:14:27] clouddb* is a big percentage of it [00:14:46] but also ..gerrit2001... eh.. let me look at that one [00:14:48] well, whatever's newly failed is in codfw though [00:14:51] then some mw* ? [00:14:59] won't be mw*, it's in the "misc" cluster [00:15:04] so not appserver or api_appserver [00:15:12] https://puppetboard.wikimedia.org/nodes?status=failed [00:15:20] we should use this as "notes_url" [00:15:23] (misc cluster only because you can see that line jump up on the grafana dashboard) [00:15:25] and fix the "missing notes url" error too [00:15:33] yeah, good call [00:15:43] late in the day for me but I'll file a task to do it tomorrow [00:15:51] er, file a task now, to do it tomorrow :) [00:16:39] perfect!. i am looking what is wrong on gerrit2001 and one of those mw* [00:17:17] surprise.. there is nothing wrong when running puppet manually [00:18:18] Also I removed the "use_php" parameter from parsoid servers right when that happened but none of them in the list at all [00:19:04] random mw2373 from the puppetboard list also shows no issues when I run it [00:19:34] gerrit2001 disappeared from puppetboard list after reloading in browser now [00:20:03] 10Operations, 10Puppet: Add notes URL for Puppet failure alerts - https://phabricator.wikimedia.org/T270354 (10RLazarus) [00:20:20] spooky [00:20:46] that accounted for some but not all of the graph increase on the dashboard too [00:20:54] (03CR) 10Dzahn: "noop on wtp1025, parse2001" [puppet] - 10https://gerrit.wikimedia.org/r/577043 (https://phabricator.wikimedia.org/T233654) (owner: 10C. Scott Ananian) [00:21:47] hm, also I was wrong about it being all under "misc", I saw a jump there and assumed it was everything, but that was only part of it [00:22:34] !log running puppet on mw2266, mw2370, mw2354 [00:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:25] I'm going to call it a day though -- if this is still mysterious in the morning, I'll poke around some more then [00:23:29] yea, ^ they all disappear from the list [00:23:33] doesn't seem to be an emergency in the meantime [00:23:35] and there are no errors or warnings [00:23:38] huh okay [00:23:47] no, it seems fine [00:23:51] that's frustrating but we can dig if it comes back [00:24:11] ack, cya tomorrow [00:32:06] (03PS1) 10Dzahn: icinga/prometheus: add notes_link to puppetboard to puppet failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/649999 [00:33:36] (03CR) 10jerkins-bot: [V: 04-1] icinga/prometheus: add notes_link to puppetboard to puppet failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/649999 (owner: 10Dzahn) [00:34:16] (03CR) 10Dzahn: "I also made an attempt to check in deployment-prep on deployment-parsoid11 but puppet is broken there for completely unrelated reasons tha" [puppet] - 10https://gerrit.wikimedia.org/r/577043 (https://phabricator.wikimedia.org/T233654) (owner: 10C. Scott Ananian) [00:34:17] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:34] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T270355" [puppet] - 10https://gerrit.wikimedia.org/r/577043 (https://phabricator.wikimedia.org/T233654) (owner: 10C. Scott Ananian) [00:40:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6019313488 and 373 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:39] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7741023024 and 856 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:39] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 862360064 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:14] (03PS2) 10Dzahn: icinga/prometheus: add notes_link to puppetboard to puppet failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/649999 [00:42:39] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1253983336 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:55] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 295791120 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:25] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5782710992 and 329 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1853992736 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:25] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8247022616 and 485 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:15] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) The "use_php" parameter we had for the migration has been removed from puppet code now. All instances are using it by default. [00:49:49] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 80808 and 212 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:21] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5488 and 245 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:15] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3064 and 357 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:29] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 145464 and 372 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:23] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 426 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:29] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216 and 492 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:33] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 160694984 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:03] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72912 and 585 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:33] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 13648 and 617 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:23] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 490743016 and 27 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T0100). [01:01:58] !log preparing to update phabricator translations [01:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:51] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50176 and 137 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:07] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 96 and 152 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:03] (03PS1) 10Legoktm: passwords: Add legoktm to Cloud VPS root [labs/private] - 10https://gerrit.wikimedia.org/r/650008 [01:15:15] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:28] (03CR) 10Ladsgroup: "> Patch Set 5:" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [01:26:02] (03PS1) 10Legoktm: aptrepo: Add thirdparty/pyall component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/650010 (https://phabricator.wikimedia.org/T241195) [01:26:45] (03PS3) 10Bstorm: toolsdb: remove temporary replication filters [puppet] - 10https://gerrit.wikimedia.org/r/636469 (https://phabricator.wikimedia.org/T257274) [01:33:45] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: REIMAGE [01:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:43] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: REIMAGE [01:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:17] (03PS2) 10Gergő Tisza: Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) [01:51:19] (03PS1) 10Gergő Tisza: Enable GrowthExperiments on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650012 (https://phabricator.wikimedia.org/T266020) [05:29:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:42:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 65, down: 10, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:50:20] (03PS1) 10Ryan Kemper: cirrus: bump es shard size alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/650021 (https://phabricator.wikimedia.org/T265908) [05:55:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for cloning db1154:3311 T268742 ', diff saved to https://phabricator.wikimedia.org/P13560 and previous config saved to /var/cache/conftool/dbconfig/20201217-055556-marostegui.json [05:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:00] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [05:56:26] !log Stop mysql on db1106 to clone db1154 [05:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:27] (03PS1) 10KartikMistry: Update cxserver to 2020-12-16-164911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/650022 (https://phabricator.wikimedia.org/T234220) [06:05:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:48] (03PS1) 10Marostegui: mariadb: Decommission es1019 [puppet] - 10https://gerrit.wikimedia.org/r/650023 (https://phabricator.wikimedia.org/T270159) [06:08:05] * kart_ updating cxserver. Minor fixes. [06:10:44] (03CR) 10Wolfgang Kandek: [C: 04-1] "phabbanlist.conf.erb generates the old (non working) style entries." [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [06:10:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es1019 [puppet] - 10https://gerrit.wikimedia.org/r/650023 (https://phabricator.wikimedia.org/T270159) (owner: 10Marostegui) [06:13:06] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-12-16-164911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/650022 (https://phabricator.wikimedia.org/T234220) (owner: 10KartikMistry) [06:13:17] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [06:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:33] (03Merged) 10jenkins-bot: Update cxserver to 2020-12-16-164911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/650022 (https://phabricator.wikimedia.org/T234220) (owner: 10KartikMistry) [06:17:31] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [06:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:16] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1019.eqiad.wmnet - https://phabricator.wikimedia.org/T270159 (10Marostegui) a:05Marostegui→03wiki_willy [06:21:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1019.eqiad.wmnet - https://phabricator.wikimedia.org/T270159 (10Marostegui) Host ready for #dc-ops [06:22:13] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1013 for decommissioning T268436', diff saved to https://phabricator.wikimedia.org/P13562 and previous config saved to /var/cache/conftool/dbconfig/20201217-062249-marostegui.json [06:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:53] T268436: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 [06:23:53] (03PS1) 10Marostegui: es1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/650024 (https://phabricator.wikimedia.org/T268436) [06:25:46] (03CR) 10Marostegui: [C: 03+2] es1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/650024 (https://phabricator.wikimedia.org/T268436) (owner: 10Marostegui) [06:27:44] PROBLEM - ores on ores2001 is CRITICAL: connect to address 10.192.0.12 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:30:06] PROBLEM - ores on ores2009 is CRITICAL: connect to address 10.192.48.90 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:32:01] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/649706 (https://phabricator.wikimedia.org/T270196) (owner: 10Elukey) [06:35:54] RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:42:00] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:44:33] (03PS1) 10Marostegui: redact_sanitarium: Add db1154 to the list of sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/650029 (https://phabricator.wikimedia.org/T268742) [06:45:16] (03CR) 10Marostegui: [C: 03+2] redact_sanitarium: Add db1154 to the list of sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/650029 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [06:46:44] (03PS1) 10Marostegui: redact_sanitarium: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/650030 (https://phabricator.wikimedia.org/T268742) [06:47:21] (03CR) 10Marostegui: [C: 03+2] redact_sanitarium: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/650030 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [06:51:48] Forgot to log earlier.. [06:52:18] !log Updated cxserver to 2020-12-16-164911-production (T234220, T234220) [06:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:22] T234220: cxserver error prevents translation of a section: Cannot read property 'replace' of undefined - https://phabricator.wikimedia.org/T234220 [06:52:41] !log Updated cxserver to 2020-12-16-164911-production (T234220, T269437) [06:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:44] T269437: Create Wikipedia Madurese - https://phabricator.wikimedia.org/T269437 [06:53:16] !log [wdqs deploy] All tests passing on canary instance `wdqs1003` prior to deploy [06:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:22] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@90f9bdd]: 0.3.56 [06:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:57] !log [wdqs deploy] Tests passing on canary instance `wdqs1003` following canary deploy, proceeding to rest of fleet [06:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:01] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@90f9bdd]: 0.3.56 (duration: 10m 39s) [07:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:04] !log [wdqs-deploy] Restarting `wdqs-updater` across all instances, 4 instances at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [07:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:38] !log [wdqs deploy] Restarting `wdqs-categories` across all test instances: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [07:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:37] !log [wdqs deploy] Restarting `wdqs-categories` across all wdqs instances, one host at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [07:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:13] !log [wdqs] depooled `wdqs1013` while it catches up on lag [07:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:29] (03CR) 10Elukey: [C: 03+2] Allow connections to dbproxy101[3,5]:3306 in the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/649706 (https://phabricator.wikimedia.org/T270196) (owner: 10Elukey) [07:08:53] !log update analytics-in4 filter on cr1/cr2-eqiad for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/649706 [07:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:06] 10Operations, 10SRE-Access-Requests: Access to prod mysql from stat1004 - https://phabricator.wikimedia.org/T270196 (10elukey) Change deployed! ` elukey@stat1004:~$ telnet m2-master.eqiad.wmnet 3306 Trying 10.64.0.135... Connected to dbproxy1013.eqiad.wmnet. ` [07:17:24] 10Operations, 10SRE-Access-Requests: Access to prod mysql from stat1004 - https://phabricator.wikimedia.org/T270196 (10elukey) 05Open→03Resolved a:03elukey [07:18:17] !log reboot an-airflow1001 for kernel upgrades [07:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 for cloning db1154:3315 T268742 ', diff saved to https://phabricator.wikimedia.org/P13563 and previous config saved to /var/cache/conftool/dbconfig/20201217-071903-marostegui.json [07:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [07:19:26] !log Stop mysql on db1082 to clone db1154 [07:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:00] (03CR) 10DCausse: [C: 03+1] cirrus: bump es shard size alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/650021 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [07:49:09] !log [wdqs deploy] (wdqs deploy complete) [07:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:57] (03PS1) 10Elukey: hive: remove analytics-replicated-hive config [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028) [08:06:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27166/console" [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [08:10:39] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga/prometheus: add notes_link to puppetboard to puppet failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/649999 (owner: 10Dzahn) [08:14:17] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1020, mc2020 to buster [puppet] - 10https://gerrit.wikimedia.org/r/650078 (https://phabricator.wikimedia.org/T213089) [08:23:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, would be nice to get unit tests" [puppet] - 10https://gerrit.wikimedia.org/r/649956 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:30:01] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [08:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:05] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [08:48:04] 10Operations, 10MediaWiki-Docker, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10kostajh) The logs don't seem to show that much useful information. I set "debug: true" in the daem... [08:58:37] (03PS4) 10Awight: Add a job for some visualeditor metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/649660 (https://phabricator.wikimedia.org/T262209) [09:00:31] !log Sanitize s1 and s5 on db1154 T268742 [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:35] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [09:01:40] PROBLEM - Check systemd state on ms-be2046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:12] 10Operations, 10Wikimedia-Mailing-lists: Have a regular cronjob which alerts about (potentially unadministrated) mailing lists with large moderation queues - https://phabricator.wikimedia.org/T270368 (10Aklapper) p:05Triage→03Lowest [09:06:21] 10Operations, 10Wikimedia-Mailing-lists: Have a regular cronjob which alerts about (potentially unadministrated) mailing lists with large moderation queues - https://phabricator.wikimedia.org/T270368 (10Aklapper) [09:07:30] (03CR) 10DCausse: "some comments about default values and the sizing of the app" (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [09:09:16] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10toan) >>! In T269678#6696244, @RLazarus wrote: > @toan You should be all set! The access group changes might take up to 30 min to roll out everywhere. Let me know if you have any tro... [09:09:34] (03PS1) 10KartikMistry: Enable ContentTranslation as default tool for ceb, km, mg, tg and yi WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650085 (https://phabricator.wikimedia.org/T269113) [09:13:11] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10elukey) @toan please also subscribe to https://lists.wikimedia.org/mailman/listinfo/analytics-announce if you want to get maintenance alerts etc.. :) [09:15:25] (03CR) 10Elukey: [V: 03+1 C: 04-2] "Scheduled for Monday 21st with stat100x reboots" [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [09:18:24] (03PS4) 10Jbond: phabricator: update RemoteIPInternalProxy with correct IP adrdess [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) [09:20:04] (03PS6) 10Jbond: P:phabricator: migrate banlist to abuse-networks [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) [09:22:09] (03CR) 10Jbond: [V: 03+1] "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [09:24:25] 10Operations, 10MediaWiki-Docker, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) Unfortunately still unable to reproduce, even with docker & docker-engine 20.10.1 (on li... [09:27:21] (03CR) 10Jbond: "> Patch Set 5: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [09:27:33] (03PS1) 10Elukey: uwsgi: remove postrotate step to avoid reloads after logrotation [puppet] - 10https://gerrit.wikimedia.org/r/650088 [09:29:04] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10toan) >>! In T269678#6697869, @elukey wrote: > @toan please also subscribe to https://lists.wikimedia.org/mailman/listinfo/analytics-announce if you want to get maintenance alerts et... [09:35:13] (03PS1) 10Marostegui: check_private_data_report: Add db1154 as sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/650089 (https://phabricator.wikimedia.org/T268742) [09:35:14] RECOVERY - Check systemd state on ms-be2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:06] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Add db1154 as sanitarium host [puppet] - 10https://gerrit.wikimedia.org/r/650089 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [09:40:27] 10Operations, 10MediaWiki-Docker, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10kostajh) I can reproduce on Debian unstable with `docker pull docker-registry.wikimedia.org/dev/st... [09:42:59] (03PS5) 10Jbond: phabricator: update RemoteIPInternalProxy with correct IP address [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) [09:43:10] (03CR) 10Catrope: [C: 03+1] Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [09:49:17] (03CR) 10Kosta Harlan: [C: 03+1] Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [09:57:05] !log repool ats-tls on cp5011 [09:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:03] (03PS6) 10Jbond: phabricator: update RemoteIPInternalProxy with correct IP address [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) [09:58:36] (03CR) 10Volans: [C: 03+1] "In theory it should be fine with copytruncate." [puppet] - 10https://gerrit.wikimedia.org/r/650088 (owner: 10Elukey) [10:01:16] (03PS7) 10Jbond: phabricator: update RemoteIPInternalProxy with correct IP address [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) [10:05:56] (03PS1) 10Volans: legoktm: remove from additional groups [puppet] - 10https://gerrit.wikimedia.org/r/650092 [10:10:00] (03PS8) 10Jbond: phabricator: update RemoteIPInternalProxy with correct IP address [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) [10:10:39] (03PS1) 10Awight: Migrate TemplateWizard to full "new" events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650093 (https://phabricator.wikimedia.org/T238230) [10:10:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27169/console" [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [10:17:19] (03PS9) 10Jbond: phabricator: update RemoteIPInternalProxy with correct IP address [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) [10:18:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27170/console" [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [10:20:52] (03CR) 10Hashar: [C: 03+1] "Lets go for it, thank you so much for tracking the root cause and working on the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [10:20:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] phabricator: update RemoteIPInternalProxy with correct IP address [puppet] - 10https://gerrit.wikimedia.org/r/649872 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [10:21:35] !log updating RemoteIP on phabricator https://gerrit.wikimedia.org/r/c/operations/puppet/+/649872 [10:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:48] (03PS1) 10Elukey: Prepare the Hadoop test cluster to be reimaged with Cloudera's CDH [puppet] - 10https://gerrit.wikimedia.org/r/650095 (https://phabricator.wikimedia.org/T269919) [10:31:43] (03PS7) 10Jbond: P:phabricator: migrate banlist to abuse-networks [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) [10:32:37] (03CR) 10Jbond: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [10:35:17] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) I was able to reproduce as well now. Seems like we are missing the "content... [10:45:21] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [10:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [10:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:21] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) p:05Low→03High Also rising priority as I guess this will affect more an... [10:53:39] (03PS1) 10Jbond: idp: add config master to idp service definition [puppet] - 10https://gerrit.wikimedia.org/r/650098 [10:54:20] (03CR) 10Jbond: [C: 03+2] idp: add config master to idp service definition [puppet] - 10https://gerrit.wikimedia.org/r/650098 (owner: 10Jbond) [10:55:01] (03PS1) 10Jbond: idp: failover to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/650099 [10:56:05] (03CR) 10Jbond: [C: 03+2] idp: failover to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/650099 (owner: 10Jbond) [10:58:32] (03CR) 10Elukey: [C: 03+2] Prepare the Hadoop test cluster to be reimaged with Cloudera's CDH [puppet] - 10https://gerrit.wikimedia.org/r/650095 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [11:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T1100). [11:08:11] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [11:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [11:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [11:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [11:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [11:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE [11:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [11:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:13] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [11:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] (03PS1) 10Filippo Giunchedi: idp: fix duplicate id for config-master [puppet] - 10https://gerrit.wikimedia.org/r/650101 (https://phabricator.wikimedia.org/T270374) [11:21:34] jbond42: ^ [11:22:42] grafana-rw broke with the duplicate id, via related task [11:22:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE [11:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:18] or any other souls for a quick +1 ? [11:23:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] idp: fix duplicate id for config-master [puppet] - 10https://gerrit.wikimedia.org/r/650101 (https://phabricator.wikimedia.org/T270374) (owner: 10Filippo Giunchedi) [11:24:12] (03CR) 10Filippo Giunchedi: [C: 03+2] idp: fix duplicate id for config-master [puppet] - 10https://gerrit.wikimedia.org/r/650101 (https://phabricator.wikimedia.org/T270374) (owner: 10Filippo Giunchedi) [11:24:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [11:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:27] (03PS1) 10KartikMistry: Update cxserver to 2020-12-17-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/650103 (https://phabricator.wikimedia.org/T262192) [11:26:10] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [11:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:27] !log bounce apache2 on grafana1002 [11:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:33] * kart_ doing another minor cxserver update, last one for the year! [11:28:30] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-12-17-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/650103 (https://phabricator.wikimedia.org/T262192) (owner: 10KartikMistry) [11:29:17] godog: sorry looking now [11:29:36] jbond42: np! change is merged and puppet has ran on idp hosts, still getting unauthorized on grafana-rw [11:29:45] (03Merged) 10jenkins-bot: Update cxserver to 2020-12-17-111820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/650103 (https://phabricator.wikimedia.org/T262192) (owner: 10KartikMistry) [11:29:55] ack thanks will look at grafana-rw [11:30:04] task is T270374 [11:30:04] T270374: Cannot log into grafana-rw.wikimedia.org (Application Not Authorized to Use CAS) - https://phabricator.wikimedia.org/T270374 [11:30:17] ack thank [11:32:53] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [11:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:39] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:34] gotta go to lunch [11:36:19] godog: no probs im on the login issue [11:36:31] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:41] !log Updated cxserver to 2020-12-17-111820-production (T262192) [11:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:44] T262192: Improve MT support for Tsonga with OpusMT - https://phabricator.wikimedia.org/T262192 [11:39:54] (03PS1) 10Jbond: idp: make sure we purge un managed services [puppet] - 10https://gerrit.wikimedia.org/r/650105 (https://phabricator.wikimedia.org/T270374) [11:41:34] (03CR) 10Jbond: [C: 03+2] idp: make sure we purge un managed services [puppet] - 10https://gerrit.wikimedia.org/r/650105 (https://phabricator.wikimedia.org/T270374) (owner: 10Jbond) [11:47:22] (03CR) 10Volans: "Early review as requested, see inline" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [11:55:19] (03PS2) 10Matthias Mullie: Remove license map from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646690 (https://phabricator.wikimedia.org/T257938) [11:55:37] (03CR) 10Matthias Mullie: [C: 03+1] "Ready for deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646690 (https://phabricator.wikimedia.org/T257938) (owner: 10Matthias Mullie) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T1200). [12:00:04] matthiasmullie and kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:06] o/ [12:00:28] * kart_ is here [12:02:02] I'll self-service & get started, shouldn't take long [12:02:26] (03CR) 10Matthias Mullie: [C: 03+2] Remove license map from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646690 (https://phabricator.wikimedia.org/T257938) (owner: 10Matthias Mullie) [12:02:49] (03PS1) 10Jbond: config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 [12:03:09] (03Merged) 10jenkins-bot: Remove license map from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646690 (https://phabricator.wikimedia.org/T257938) (owner: 10Matthias Mullie) [12:03:43] musikanimal: OK. Ping me when done. [12:07:31] Oh that was for matthiasmullie :) [12:07:31] kart_: syncing ATM - want me to do yours too, or would you rather do yourself? [12:07:51] matthiasmullie: if you can deploy, that would be great. I can test them quickly. [12:07:57] !log mlitn@deploy1001 Synchronized wmf-config/SearchSettingsForSDC.php: 68ac6fa61: Media Search: Remove license map from config (duration: 01m 04s) [12:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:23] kart_: sure; give me 2 more min to wrap this one up with a last quick test [12:08:38] Sure! [12:08:57] * Urbanecm waves [12:09:59] Urbanecm: I just deployed my change and was going to do kart_ 's as well, unless you prefer to take over? [12:10:11] feel free to do kart_'s as well :) [12:10:41] (03PS2) 10Matthias Mullie: Add Wikidocumentaries campaign for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649854 (https://phabricator.wikimedia.org/T269875) (owner: 10KartikMistry) [12:10:47] (03CR) 10Matthias Mullie: [C: 03+2] Add Wikidocumentaries campaign for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649854 (https://phabricator.wikimedia.org/T269875) (owner: 10KartikMistry) [12:11:51] (03Merged) 10jenkins-bot: Add Wikidocumentaries campaign for ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649854 (https://phabricator.wikimedia.org/T269875) (owner: 10KartikMistry) [12:13:12] kart_: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/649854 (wikidocumentaries campaign) is on mwdebug1001 [12:13:31] (03PS2) 10Matthias Mullie: Enable ContentTranslation as default tool for ceb, km, mg, tg and yi WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650085 (https://phabricator.wikimedia.org/T269113) (owner: 10KartikMistry) [12:15:55] matthiasmullie: Thanks. Bit complicated to test, so it is fine to deploy. [12:15:57] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:16:08] kart_: LMK when it's ok to proceed [12:16:21] matthiasmullie: just go ahead :) [12:16:31] (03CR) 10Urbanecm: [C: 04-1] "See inline comments." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [12:16:46] (03PS2) 10Jbond: config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 [12:16:57] (03CR) 10Matthias Mullie: [C: 03+2] Enable ContentTranslation as default tool for ceb, km, mg, tg and yi WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650085 (https://phabricator.wikimedia.org/T269113) (owner: 10KartikMistry) [12:17:20] (03CR) 10Urbanecm: [C: 03+1] Enable GrowthExperiments on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650012 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [12:17:23] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:17:38] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a29fec312: Add Wikidocumentaries campaign for ContentTranslation (duration: 01m 02s) [12:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:54] kart_: want to test the other one on mwdebug, or also straight to prod? [12:18:03] (03Merged) 10jenkins-bot: Enable ContentTranslation as default tool for ceb, km, mg, tg and yi WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650085 (https://phabricator.wikimedia.org/T269113) (owner: 10KartikMistry) [12:18:23] matthiasmullie: that requires testing a bit on mwdebug. [12:18:45] okay - should be on mwdebug1001 now [12:20:55] matthiasmullie: looks good. Please deploy. [12:21:17] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10ema) >>! In T270074#6696782, @srodlund wrote: > I looked at the doc and was able to copy edit it! If you are able to go through a... [12:21:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [12:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:32] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: f3a50cb06: Enable ContentTranslation as default tool for ceb, km, mg, tg and yi WPs (duration: 01m 02s) [12:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:38] kart_: synced [12:22:56] No more last-minute additions to this backport window? [12:23:08] matthiasmullie: cool. Thanks a lot! [12:23:21] matthiasmullie: Nothing for 2020 from me :) [12:23:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [12:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:35] !log EU backport+config window done [12:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:12] (03PS3) 10Jbond: config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 [12:26:14] (03PS4) 10Jbond: config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 [12:27:59] (03PS5) 10Jbond: config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 [12:29:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27175/console" [puppet] - 10https://gerrit.wikimedia.org/r/650106 (owner: 10Jbond) [12:31:32] (03PS6) 10Jbond: config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 [12:32:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27176/console" [puppet] - 10https://gerrit.wikimedia.org/r/650106 (owner: 10Jbond) [12:33:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] config-master: created and NDA protected directory on config master [puppet] - 10https://gerrit.wikimedia.org/r/650106 (owner: 10Jbond) [12:34:21] (03PS1) 10Jbond: Revert "config-master: created and NDA protected directory on config master" [puppet] - 10https://gerrit.wikimedia.org/r/649921 [12:34:27] PROBLEM - SSH on mw1277.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:36:05] !log disable puppet fleet wide for condif master vhost change [12:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:55] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 513092216 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:11] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 519292368 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:51] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.03429 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:39:12] (03PS1) 10Hnowlan: maps: increase retries on postgres alert [puppet] - 10https://gerrit.wikimedia.org/r/650108 [12:39:27] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:39] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5872 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:11] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10Tobi_WMDE_SW) 05Resolved→03Open I think this task has been closed too early and @lilients_WMDE hasn't been add to those groups yet. I'm confirming she's in my team and... [12:40:43] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 537012320 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13566 and previous config saved to /var/cache/conftool/dbconfig/20201217-124052-root.json [12:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:57] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [12:41:28] (03PS1) 10Jbond: configmaster: move pybal-config to an aliase and config-master to vhost [puppet] - 10https://gerrit.wikimedia.org/r/650109 [12:42:01] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11000 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:18] (03CR) 10Jbond: [C: 03+2] configmaster: move pybal-config to an aliase and config-master to vhost [puppet] - 10https://gerrit.wikimedia.org/r/650109 (owner: 10Jbond) [12:46:45] jbond42: neat, thanks for the idp fix [12:47:04] godog: np sorry for the breackage ;) [12:47:25] RECOVERY - Long running screen/tmux on maps1006 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [12:51:59] (03PS1) 10Jbond: config-master: nda folder use html readme file [puppet] - 10https://gerrit.wikimedia.org/r/650110 [12:54:20] (03CR) 10Jbond: [C: 03+2] config-master: nda folder use html readme file [puppet] - 10https://gerrit.wikimedia.org/r/650110 (owner: 10Jbond) [12:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 after cloning db1154:3311 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13567 and previous config saved to /var/cache/conftool/dbconfig/20201217-125446-marostegui.json [12:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:50] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [12:55:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change db1089 weights', diff saved to https://phabricator.wikimedia.org/P13568 and previous config saved to /var/cache/conftool/dbconfig/20201217-125535-marostegui.json [12:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13569 and previous config saved to /var/cache/conftool/dbconfig/20201217-125556-root.json [12:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 25%: Repool db1089 after helping out on db1106', diff saved to https://phabricator.wikimedia.org/P13570 and previous config saved to /var/cache/conftool/dbconfig/20201217-125624-root.json [12:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:33] (03PS5) 10Tchanders: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) [12:59:59] (03PS10) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [13:01:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 to clone db1154:3318 add db1092 as vslow,dump service for s8 T268742 ', diff saved to https://phabricator.wikimedia.org/P13571 and previous config saved to /var/cache/conftool/dbconfig/20201217-130101-marostegui.json [13:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:08] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [13:01:23] !log Stop mysql on db1087 to clone db1154 [13:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:07] PROBLEM - config-master.wikimedia.org requires authentication on puppetmaster2001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:04:03] (03PS1) 10Jbond: P:configmaster: add abuse networks to NDA folder [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) [13:04:29] (03Abandoned) 10Jbond: Revert "config-master: created and NDA protected directory on config master" [puppet] - 10https://gerrit.wikimedia.org/r/649921 (owner: 10Jbond) [13:05:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27177/console" [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:05:26] (03CR) 10Jbond: "This is not great as it removes comments but its a start" [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:05:59] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01371 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:06:10] (03CR) 10Jbond: "e.g.: https://puppet-compiler.wmflabs.org/compiler1002/27177/puppetmaster2001.codfw.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:09:44] (03CR) 10Volans: [C: 03+1] "Nice! Thanks for patch, small nit inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13574 and previous config saved to /var/cache/conftool/dbconfig/20201217-131059-root.json [13:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:04] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [13:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 50%: Repool db1089 after helping out on db1106', diff saved to https://phabricator.wikimedia.org/P13575 and previous config saved to /var/cache/conftool/dbconfig/20201217-131127-root.json [13:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:36] (03PS2) 10Jbond: P:configmaster: add abuse networks to NDA folder [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) [13:11:54] (03CR) 10Jbond: P:configmaster: add abuse networks to NDA folder (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:13:31] (03PS3) 10Tchanders: extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) [13:13:33] (03PS3) 10Tchanders: Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) [13:13:35] (03PS5) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) [13:13:37] (03PS6) 10Tchanders: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) [13:14:13] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:15:57] PROBLEM - config-master.wikimedia.org requires authentication on puppetmaster1001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:21:55] (03PS1) 10Elukey: Restore puppet roles for the Hadoop test cluster after reimage [puppet] - 10https://gerrit.wikimedia.org/r/650113 (https://phabricator.wikimedia.org/T269919) [13:22:46] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/650092 (owner: 10Volans) [13:23:07] (03CR) 10Elukey: [C: 03+2] Restore puppet roles for the Hadoop test cluster after reimage [puppet] - 10https://gerrit.wikimedia.org/r/650113 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [13:24:04] (03CR) 10Jbond: [C: 03+2] P:configmaster: add abuse networks to NDA folder [puppet] - 10https://gerrit.wikimedia.org/r/650111 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:26:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Repooling after cloning db1154:3315 as sanitarium T268742', diff saved to https://phabricator.wikimedia.org/P13576 and previous config saved to /var/cache/conftool/dbconfig/20201217-132603-root.json [13:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:07] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [13:26:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 75%: Repool db1089 after helping out on db1106', diff saved to https://phabricator.wikimedia.org/P13577 and previous config saved to /var/cache/conftool/dbconfig/20201217-132631-root.json [13:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:46] (03PS1) 10Ppchelko: Article: view from old revision cache - set correct revId. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649923 (https://phabricator.wikimedia.org/T270361) [13:34:55] RECOVERY - SSH on mw1277.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:38:31] (03CR) 10Jbond: P:phabricator: migrate banlist to abuse-networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:38:49] (03CR) 10Ssingh: [C: 03+1] "+1, LGTM: I checked this against my own commit to aptrepo and can verify the key as well. I guess we should get an additional +1 from some" [puppet] - 10https://gerrit.wikimedia.org/r/650010 (https://phabricator.wikimedia.org/T241195) (owner: 10Legoktm) [13:38:54] (03Abandoned) 10Hashar: gerrit: move java config from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/639273 (owner: 10Hashar) [13:41:01] (03CR) 10Elukey: [C: 03+2] gerrit: remove obsolete profile::gerrit::java_version [puppet] - 10https://gerrit.wikimedia.org/r/639272 (owner: 10Hashar) [13:41:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 100%: Repool db1089 after helping out on db1106', diff saved to https://phabricator.wikimedia.org/P13578 and previous config saved to /var/cache/conftool/dbconfig/20201217-134134-root.json [13:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1089 into API', diff saved to https://phabricator.wikimedia.org/P13579 and previous config saved to /var/cache/conftool/dbconfig/20201217-134513-marostegui.json [13:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:25] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01119 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:52:04] (03PS1) 10Jbond: P:configmaster: Us ipaddress6? fact instead of DNS [puppet] - 10https://gerrit.wikimedia.org/r/650118 (https://phabricator.wikimedia.org/T270359) [13:52:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:53:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:55:40] (03PS2) 10Jbond: P:configmaster: Us ipaddress6? fact instead of DNS [puppet] - 10https://gerrit.wikimedia.org/r/650118 (https://phabricator.wikimedia.org/T270359) [13:57:06] (03CR) 10Hashar: gerrit: use proper hostname on replica hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [13:57:08] (03CR) 10jerkins-bot: [V: 04-1] P:configmaster: Us ipaddress6? fact instead of DNS [puppet] - 10https://gerrit.wikimedia.org/r/650118 (https://phabricator.wikimedia.org/T270359) (owner: 10Jbond) [13:57:16] (03PS5) 10Hashar: gerrit: use proper hostname on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/643919 [13:57:58] (03PS3) 10Jbond: P:configmaster: Us ipaddress6? fact instead of DNS [puppet] - 10https://gerrit.wikimedia.org/r/650118 (https://phabricator.wikimedia.org/T270359) [13:59:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27180/console" [puppet] - 10https://gerrit.wikimedia.org/r/650118 (https://phabricator.wikimedia.org/T270359) (owner: 10Jbond) [13:59:16] (03CR) 10Hashar: gerrit: use proper hostname on replica hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [13:59:30] (03PS6) 10Hashar: gerrit: use proper hostname on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/643919 [14:00:22] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [14:00:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:configmaster: Us ipaddress6? fact instead of DNS [puppet] - 10https://gerrit.wikimedia.org/r/650118 (https://phabricator.wikimedia.org/T270359) (owner: 10Jbond) [14:04:07] (03CR) 10jerkins-bot: [V: 04-1] Article: view from old revision cache - set correct revId. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649923 (https://phabricator.wikimedia.org/T270361) (owner: 10Ppchelko) [14:05:32] (03PS1) 10Jbond: cfssl_ocsprefresh: blank CR soliciting general python post-review [puppet] - 10https://gerrit.wikimedia.org/r/650120 [14:08:57] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P920 --new-data-type external-id # T269205 [14:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:01] T269205: Change P920 property’s data type from string to external identifier - https://phabricator.wikimedia.org/T269205 [14:09:43] (03PS3) 10Hashar: doc: fix fallback to WMF_DOC_PATH files [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) [14:12:55] (03CR) 10Hashar: "So yeah this one is next and will not affect the behavior on doc.wikimedia.org. It is just that my local setup had two copies of the doc " [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:13:19] (03PS3) 10Hashar: doc: switch to scap DocumentRoot [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) [14:16:23] (03CR) 10Hashar: "This change originally failed last week cause we were not falling back to WMF_DOC_PATH which I have later implemented in parent change htt" [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:36:13] (03CR) 10Volans: [C: 03+1] "Did a full pass as requested, the structure seems good, I've left some comments, mostly optional or for future expansions." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650120 (owner: 10Jbond) [14:39:24] (03CR) 10RLazarus: [C: 03+1] "You were right. :) My fault, thanks for the fix." [puppet] - 10https://gerrit.wikimedia.org/r/650092 (owner: 10Volans) [14:39:48] (03CR) 10RLazarus: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/650092 (owner: 10Volans) [14:43:29] (03PS1) 10David Caro: [wmcs][backups] Add project and vm info [puppet] - 10https://gerrit.wikimedia.org/r/650141 [14:46:37] (03CR) 10Ppchelko: "recheck" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649923 (https://phabricator.wikimedia.org/T270361) (owner: 10Ppchelko) [14:48:25] (03CR) 10Volans: [C: 03+2] legoktm: remove from additional groups [puppet] - 10https://gerrit.wikimedia.org/r/650092 (owner: 10Volans) [14:48:36] 10Operations, 10Traffic, 10netops, 10User-jbond: varnihs filtering: should we automaticly update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10jbond) p:05Triage→03Medium [14:49:28] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for North-West Russia Wiki-Historians UG - https://phabricator.wikimedia.org/T270392 (10Red) [14:55:58] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) Package is backported and uploaded to reprepro/aptx00y and updated on all stats100x machines. [14:56:22] 10Operations, 10Traffic, 10netops, 10User-jbond: varnihs filtering: should we automaticly update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) AWS allow to subscribe to the modification of the list fwiw, see https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html#subscri... [14:57:51] 10Operations, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) [15:00:17] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8e9253b]: Add various wikis T268459 T269428 T269433 T268413 T269441 [15:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:25] T268413: Add skrwiki to RESTBase - https://phabricator.wikimedia.org/T268413 [15:00:25] T269428: Add eowikivoyage to RESTBase - https://phabricator.wikimedia.org/T269428 [15:00:25] T269433: Add wawikisource to RESTBase - https://phabricator.wikimedia.org/T269433 [15:00:26] T268459: Add skrwiktionary to RESTBase - https://phabricator.wikimedia.org/T268459 [15:00:26] T269441: Add madwiki to RESTBase - https://phabricator.wikimedia.org/T269441 [15:01:44] (03PS1) 10Phuedx: Revert "vue: Log component errors" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649924 [15:14:29] (03CR) 10Ottomata: Migrate TemplateWizard to full "new" events (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650093 (https://phabricator.wikimedia.org/T238230) (owner: 10Awight) [15:16:31] (03PS1) 10JMeybohm: docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) [15:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 after cloning db1154:3318', diff saved to https://phabricator.wikimedia.org/P13582 and previous config saved to /var/cache/conftool/dbconfig/20201217-152233-marostegui.json [15:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:00] (03PS1) 10Jbond: httpd: Add abbility to remove the defauls ports configueration [puppet] - 10https://gerrit.wikimedia.org/r/650154 (https://phabricator.wikimedia.org/T263831) [15:23:02] (03PS1) 10Jbond: puppetmaster: remove default apache ports from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/650155 (https://phabricator.wikimedia.org/T263831) [15:23:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully depool db1092', diff saved to https://phabricator.wikimedia.org/P13583 and previous config saved to /var/cache/conftool/dbconfig/20201217-152347-marostegui.json [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 25%: Repooling after cloning db1092 after serving as vslow/dump', diff saved to https://phabricator.wikimedia.org/P13584 and previous config saved to /var/cache/conftool/dbconfig/20201217-152420-root.json [15:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:19] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: remove default apache ports from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/650155 (https://phabricator.wikimedia.org/T263831) (owner: 10Jbond) [15:26:34] (03PS2) 10Jbond: puppetmaster: remove default apache ports from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/650155 (https://phabricator.wikimedia.org/T263831) [15:27:37] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) Yahoo! [15:27:40] (03PS1) 10Ema: vcl: do not stream responses to docker [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) [15:32:13] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8e9253b]: Add various wikis T268459 T269428 T269433 T268413 T269441 (duration: 31m 57s) [15:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:19] T268413: Add skrwiki to RESTBase - https://phabricator.wikimedia.org/T268413 [15:32:20] T269428: Add eowikivoyage to RESTBase - https://phabricator.wikimedia.org/T269428 [15:32:20] T269433: Add wawikisource to RESTBase - https://phabricator.wikimedia.org/T269433 [15:32:20] T268459: Add skrwiktionary to RESTBase - https://phabricator.wikimedia.org/T268459 [15:32:20] T269441: Add madwiki to RESTBase - https://phabricator.wikimedia.org/T269441 [15:33:05] (03CR) 10Bstorm: [C: 03+2] toolsdb: remove temporary replication filters [puppet] - 10https://gerrit.wikimedia.org/r/636469 (https://phabricator.wikimedia.org/T257274) (owner: 10Bstorm) [15:34:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27183/console" [puppet] - 10https://gerrit.wikimedia.org/r/650155 (https://phabricator.wikimedia.org/T263831) (owner: 10Jbond) [15:34:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: remove default apache ports from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/650155 (https://phabricator.wikimedia.org/T263831) (owner: 10Jbond) [15:34:53] (03CR) 10Jbond: [C: 03+2] httpd: Add abbility to remove the defauls ports configueration [puppet] - 10https://gerrit.wikimedia.org/r/650154 (https://phabricator.wikimedia.org/T263831) (owner: 10Jbond) [15:36:20] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4c2e0b6]: Add various wikis T268459 T269428 T269433 T268413 T269441 [15:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 50%: Repooling after cloning db1092 after serving as vslow/dump', diff saved to https://phabricator.wikimedia.org/P13585 and previous config saved to /var/cache/conftool/dbconfig/20201217-153923-root.json [15:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:15] (03PS1) 10Jbond: httpd: just empty file don't remove it as its included in apache.conf [puppet] - 10https://gerrit.wikimedia.org/r/650160 [15:40:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] httpd: just empty file don't remove it as its included in apache.conf [puppet] - 10https://gerrit.wikimedia.org/r/650160 (owner: 10Jbond) [15:41:20] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01121 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:41:29] ^^ looking [15:43:38] 10Operations, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [15:43:42] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo) [15:45:14] (03PS3) 10RLazarus: icinga/prometheus: add notes_link to puppetboard to puppet failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/649999 (https://phabricator.wikimedia.org/T270354) (owner: 10Dzahn) [15:45:35] 10Operations, 10Puppet, 10Patch-For-Review: Add notes URL for Puppet failure alerts - https://phabricator.wikimedia.org/T270354 (10RLazarus) a:05RLazarus→03Dzahn ... Or @Dzahn could just fix it in https://gerrit.wikimedia.org/r/649999. :D Thanks! [15:46:11] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo) I have proposed to do the final steps for decommission of old backups servers heze and helium in FY2021Q3. T260717 (CC @akosiaris @MoritzMuehlenhoff - I don't think I will require anything specific... [15:46:55] (03CR) 10RLazarus: [C: 03+1] "LGTM! For the record, the workaround for that broken embed is to switch from grafana to grafana-rw, I assume so that it has access to the " [puppet] - 10https://gerrit.wikimedia.org/r/649999 (https://phabricator.wikimedia.org/T270354) (owner: 10Dzahn) [15:54:11] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4c2e0b6]: Add various wikis T268459 T269428 T269433 T268413 T269441 (duration: 17m 51s) [15:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:19] T268413: Add skrwiki to RESTBase - https://phabricator.wikimedia.org/T268413 [15:54:19] T269428: Add eowikivoyage to RESTBase - https://phabricator.wikimedia.org/T269428 [15:54:20] T269433: Add wawikisource to RESTBase - https://phabricator.wikimedia.org/T269433 [15:54:20] T268459: Add skrwiktionary to RESTBase - https://phabricator.wikimedia.org/T268459 [15:54:20] T269441: Add madwiki to RESTBase - https://phabricator.wikimedia.org/T269441 [15:54:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 75%: Repooling after cloning db1092 after serving as vslow/dump', diff saved to https://phabricator.wikimedia.org/P13586 and previous config saved to /var/cache/conftool/dbconfig/20201217-155427-root.json [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:26] (03CR) 10Andrew Bogott: [C: 03+1] partman: build a recipe to re-image nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/647815 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [15:58:55] (03CR) 10Bstorm: [C: 03+2] partman: build a recipe to re-image nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/647815 (https://phabricator.wikimedia.org/T266199) (owner: 10Bstorm) [16:00:15] (03PS2) 10Ema: vcl: do not stream responses to docker [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) [16:02:23] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10RLazarus) a:05lilients_WMDE→03RLazarus Looks that way! I'll take care of this today, thanks for flagging. [16:05:01] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: fix oozie shlib path [puppet] - 10https://gerrit.wikimedia.org/r/650165 [16:05:31] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: fix oozie shlib path [puppet] - 10https://gerrit.wikimedia.org/r/650165 (owner: 10Elukey) [16:09:28] (03CR) 10Mstyles: Add new helm chart for rdf-streaming-updater (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [16:09:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 100%: Repooling after cloning db1092 after serving as vslow/dump', diff saved to https://phabricator.wikimedia.org/P13588 and previous config saved to /var/cache/conftool/dbconfig/20201217-160930-root.json [16:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] (03PS3) 10Ema: vcl: do not stream responses to docker [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) [16:10:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092 into API', diff saved to https://phabricator.wikimedia.org/P13589 and previous config saved to /var/cache/conftool/dbconfig/20201217-161052-marostegui.json [16:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:41] (03CR) 10Gehel: [C: 03+1] "good enough for now, we can iterate once the basic version is merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [16:13:08] (03CR) 10JMeybohm: [C: 03+1] "I have no clue of VCL. With that in mind: LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [16:13:30] (03CR) 10Ema: [C: 03+2] vcl: do not stream responses to docker [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [16:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092 with its original weight into API', diff saved to https://phabricator.wikimedia.org/P13590 and previous config saved to /var/cache/conftool/dbconfig/20201217-161453-marostegui.json [16:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:52] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1020, mc2020 to buster [puppet] - 10https://gerrit.wikimedia.org/r/650078 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [16:18:38] (03PS1) 10Jbond: config-master: add envot tlsproxy config [puppet] - 10https://gerrit.wikimedia.org/r/650166 (https://phabricator.wikimedia.org/T270185) [16:19:45] (03PS2) 10Jbond: config-master: add envot tlsproxy config [puppet] - 10https://gerrit.wikimedia.org/r/650166 (https://phabricator.wikimedia.org/T270185) [16:22:14] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2020.codfw.wmnet ` The log can be... [16:22:18] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1020.eqiad.wmnet ` The log can be... [16:24:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27188/console" [puppet] - 10https://gerrit.wikimedia.org/r/650166 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [16:24:22] (03PS2) 10Reedy: Enable StopForumSpam on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) [16:25:17] (03CR) 10Jbond: [V: 03+1 C: 03+2] config-master: add envot tlsproxy config [puppet] - 10https://gerrit.wikimedia.org/r/650166 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [16:26:43] (03PS1) 10Cwhite: profile: deploy filter_scripts directory to logstash 7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/650170 (https://phabricator.wikimedia.org/T234565) [16:28:26] RECOVERY - config-master.wikimedia.org requires authentication on puppetmaster1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:29:03] (03CR) 10SBassett: "Per I1fa922adb, we'll definitely want to update the values for SFSIPListLocation and SFSIPListLocationMD5. Also wondering if we should ru" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [16:31:54] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) 05Open→03Resolved a:03JMeybohm For the record: We where sending "Content-Type:... [16:31:59] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) This should now be fixed by using https://gerrit.wikimedia.org/r/c/operations/puppet/+/650... [16:32:43] (03CR) 10Klausman: [C: 03+2] "Copytruncate and not restarting the logging daemon is exactly the right thing to do here." [puppet] - 10https://gerrit.wikimedia.org/r/650088 (owner: 10Elukey) [16:34:04] (03PS1) 10Jbond: trafficserver: connect to config master backend using TLS [puppet] - 10https://gerrit.wikimedia.org/r/650171 (https://phabricator.wikimedia.org/T270185) [16:35:04] 10Operations, 10MediaWiki-Containers: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10dancy) Right now when you visit https://docker-registry.wikimedia.org/ you get a generic nginx welcome page. I propose having this URL redirect to https://dockerregistry.tool... [16:35:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27189/console" [puppet] - 10https://gerrit.wikimedia.org/r/650171 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [16:36:04] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1020.eqiad.wmnet with reason: REIMAGE [16:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:38] 10Operations, 10Traffic, 10Patch-For-Review: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10JMeybohm) a:03JMeybohm [16:38:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1020.eqiad.wmnet with reason: REIMAGE [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:40] RECOVERY - config-master.wikimedia.org requires authentication on puppetmaster2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:40:05] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10kostajh) 05Resolved→03Open >>! In T270270#6698851, @ema wrote: > This should now be fixed b... [16:40:16] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2020.codfw.wmnet with reason: REIMAGE [16:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:48] (03CR) 10Reedy: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [16:41:39] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) Maybe Mac sends a different user-agent? That would be fun... [16:42:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2020.codfw.wmnet with reason: REIMAGE [16:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:36] (03Abandoned) 10Hashar: Phabricator: New IP banning format, remove lines in old non-working format [puppet] - 10https://gerrit.wikimedia.org/r/649753 (https://phabricator.wikimedia.org/T270185) (owner: 10Wolfgang Kandek) [16:50:16] (03CR) 10Ema: [C: 03+1] trafficserver: connect to config master backend using TLS [puppet] - 10https://gerrit.wikimedia.org/r/650171 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [16:50:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] trafficserver: connect to config master backend using TLS [puppet] - 10https://gerrit.wikimedia.org/r/650171 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [16:52:01] (03CR) 10Kosta Harlan: vcl: do not stream responses to docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [16:54:55] (03CR) 10Dzahn: [C: 03+2] icinga/prometheus: add notes_link to puppetboard to puppet failure alerts [puppet] - 10https://gerrit.wikimedia.org/r/649999 (https://phabricator.wikimedia.org/T270354) (owner: 10Dzahn) [16:58:12] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/650108 (owner: 10Hnowlan) [16:58:52] (03CR) 10Jbond: "was ment to be linked to T263831" [puppet] - 10https://gerrit.wikimedia.org/r/650171 (https://phabricator.wikimedia.org/T270185) (owner: 10Jbond) [16:59:15] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/650021 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [16:59:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1019.eqiad.wmnet - https://phabricator.wikimedia.org/T270159 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:59:39] (03CR) 10Hnowlan: [C: 03+2] maps: increase retries on postgres alert [puppet] - 10https://gerrit.wikimedia.org/r/650108 (owner: 10Hnowlan) [16:59:45] 10Operations, 10Traffic, 10serviceops, 10HTTPS: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10jbond) 05Open→03Resolved a:03jbond Sorry for the delay however this has been configured now [16:59:47] 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10jbond) [17:00:04] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T1700). [17:00:08] (03PS1) 10David Caro: [wmcs] Move some heavy backups to cloudvirt1026 [puppet] - 10https://gerrit.wikimedia.org/r/650178 (https://phabricator.wikimedia.org/T269419) [17:02:31] 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Dzahn) >>! In T108580#6488253, @BBlack wrote: > $ grep 'replacement: http:' hieradata/common/profile/trafficserver/backend.yaml > replacement: http://puppetmaster1001.... [17:04:09] (03PS1) 10Razzi: role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) [17:04:11] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) also see T108580 [17:04:48] 10Operations, 10Traffic, 10serviceops, 10HTTPS: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10Dzahn) @jbond @ema So puppetmaster1001 can also be checked off on T210411 ? [17:07:06] (03PS2) 10David Caro: [wmcs][backups] Add project and vm info [puppet] - 10https://gerrit.wikimedia.org/r/650141 (https://phabricator.wikimedia.org/T267195) [17:09:36] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1020.eqiad.wmnet'] ` and were **ALL** successful. [17:11:14] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2020.codfw.wmnet'] ` and were **ALL** successful. [17:15:38] rzl: re: "widespread puppet" alert.. just noticed this.. a bug as well? https://phab.wmfusercontent.org/file/data/d3ks2hlgl3u7mnpb3zo3/PHID-FILE-2ob4ivku4t2d6x2a424w/Screenshot_at_2020-12-17_09-14-03.png [17:17:42] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:51] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: deploy filter_scripts directory to logstash 7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/650170 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:22:16] ah, ok. so the second one seems to be a meta-check that checks if the other check gets data [17:22:36] * mutante stops making a ticket [17:24:10] CUSTOM - Widespread puppet agent failures on alert1001 is WARNING: 0.007477 ge 0.006 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:24:20] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:16] 10Operations, 10Puppet: Add notes URL for Puppet failure alerts - https://phabricator.wikimedia.org/T270354 (10Dzahn) 05Open→03Resolved Deployed and tested by sending a custom notification: ` <+icinga-wm> CUSTOM - Widespread puppet agent failures on alert1001 is WARNING: 0.007477 ge 0.006 https://puppetbo... [17:25:40] (03CR) 10Dzahn: "<+icinga-wm> CUSTOM - Widespread puppet agent failures on alert1001 is WARNING: 0.007477 ge 0.006 https://puppetboard.wikimedia.org/nodes?" [puppet] - 10https://gerrit.wikimedia.org/r/649999 (https://phabricator.wikimedia.org/T270354) (owner: 10Dzahn) [17:27:08] (03CR) 10Elukey: [C: 03+2] druid: Migrate hiera() to lookup() and set data type in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/649721 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:27:55] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) >>! In T270270#6698866, @JMeybohm wrote: > Maybe Mac sends a different user-agen... [17:30:49] (03CR) 10Dzahn: [C: 03+2] doc: fix fallback to WMF_DOC_PATH files [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:31:14] (03CR) 10Dzahn: "per "already tested on staging instance in devtools (yay!)"" [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:31:24] mutante: I am around :] [17:31:37] hashar: cool, deploying [17:31:52] I feel ashamed for last week deploy [17:31:56] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:32:11] cause clearly I should have validated the whole chain from scratch instead of building up on preexisting condition. [17:32:13] ahhh.. nah, don't worry about it [17:32:22] I would have hit the issue the same way I did on monday when rebuilding on WMCS instance [17:32:23] but [17:32:23] it's cool that you did that on devtools now [17:32:27] we now have a staging area! [17:32:34] change applied on doc1001.. running httpbb [17:32:47] PASS: 9 requests sent to doc1001.eqiad.wmnet. All assertions passed. [17:33:10] yes, staging area is nice [17:33:48] hashar: ok, do you wanna try the actual switch as well? [17:34:16] if you feel like it SURE! [17:34:39] (03PS4) 10Dzahn: doc: switch to scap DocumentRoot [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:35:14] that is the step that failed last week [17:35:19] but should work now :] [17:35:38] yep, but now we have tests too [17:35:53] (03CR) 10Dzahn: [C: 03+2] doc: switch to scap DocumentRoot [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:37:08] yeah the tests are great [17:38:35] !log restarted apache2 on doc1001 [17:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:48] hashar: tests are still passing ^ [17:38:54] oh really! [17:38:56] oh wait [17:39:03] not yet :) [17:39:18] sorry, i got a phone call of course in the worst moment [17:39:23] oh the suspense [17:39:55] lol [17:40:07] dont worry, it is not like there are million of users [17:40:11] !log restarted apache2 on doc1001 [17:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:15] NOW they are applied [17:40:19] and the tests still pass [17:41:02] well that was uneventful .. as we like it [17:41:17] the beauty of spamming a lot of tiny atomic changes [17:41:25] indeed [17:41:43] the next one has a conflict [17:41:48] and requires some directory to be moved around [17:42:00] but I think i have to clean it up first [17:42:16] now we are getting into the "remove old path" part, right [17:42:37] ok, ack [17:42:59] not immediately following up with the removal is not a bad idea anyways [17:46:22] (03PS6) 10Hashar: doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) [17:46:41] mutante: I have rebased the next change in the series [17:46:58] that needs restart of apache / rsync [17:47:31] and a manual move of files /srv/docroot/org/wikimedia/doc ver /srv/doc (which only contains a BACKMEUP dummy file ) [17:50:29] so essentially rm /srv/doc && mv /srv/docroot/org/wikimedia/doc /srv/doc [17:52:06] hashar: If we can accept the short downtime, ok, let's get it done with. [17:52:37] (03CR) 10Dzahn: [C: 03+2] doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:53:06] yeah there is a tiny race condition but it is not the end of the world [17:53:14] yep, agreed [17:53:35] so I guess run puppet [17:53:49] restart rsync / apache to ensure the conf is properly applied [17:53:51] mv the files [17:53:56] err [17:54:09] mv the dir /srv/docroot/org/wikimedia/doc over /srv/doc [17:54:12] can you test doc-uploader can still upload too? [17:54:36] yeah [17:55:00] !log doc1001 - systemctl restart rsync [17:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:28] !log doc1001 - mv /srv/docroot/org/wikimedia/doc/ /srv/doc [17:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:35] (03PS5) 10Hashar: doc: stop backup for old doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) [17:55:38] !log doc1001 systemctl restart apache2 [17:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:52] hashar: tests fail [17:55:54] c'est l'instant de vérité [17:56:32] hmm stuff seems to work for me [17:56:40] at least for the few manual test cases I am doing [17:57:03] ah no https://doc.wikimedia.org/cover/ fails [17:57:06] https://phabricator.wikimedia.org/P13592 [17:57:12] 8 out of 9 fail [17:57:43] yeah error 500 :/ [17:58:02] oh yeah [17:58:05] got moved to the wrong place :] [17:58:08] /srv/doc/doc ! [17:58:13] yes [17:58:41] good news some CI job managed to published stuff under /srv/doc so that part works [17:58:47] well, and now we messed it up by moving things at the same time [17:58:57] yeah [17:58:59] I said to move over [17:59:01] but it is not a big deal [17:59:06] let me move stuff around :] [18:00:02] well I was also in the middle of fixing it when stuff disappeared . stepped back [18:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T1800). [18:00:09] ah sorry [18:00:33] damn [18:00:45] so we have lost coverage reports of all extensions [18:01:17] see doc.old [18:01:22] cause the ci published doc /srv/cover-extensions dir only had WikiLambda [18:01:33] err [18:01:38] cause the ci published doc /srv/doc/cover-extensions dir only had WikiLambda [18:01:46] and got moved over to /srv/doc/doc.old/cover-extensions [18:01:52] effectively wiping all the stuff in there [18:02:22] not sure I follow, it's /src/doc.old [18:02:26] /srv/doc.old [18:03:10] it just had to move up one level [18:03:31] checking checking [18:03:40] we clearly stepped on each others toes while moving stuff [18:03:41] yeah I found them [18:03:53] i am not running anything except ls for now [18:04:06] I was doing some mv command at the same time [18:04:09] they conflicted somehow [18:04:12] but the files are still there [18:04:14] \o/ [18:04:17] yes, that. confirmed [18:04:19] :) [18:04:25] can you move /srv/doc.old to /srv/doc please ? [18:04:45] done: [18:04:50] @doc1001:/srv# ls [18:04:50] deployment doc docroot [18:04:56] great [18:04:59] PASS: 9 requests sent to doc1001.eqiad.wmnet. All assertions passed. [18:05:18] I will restore the cover-extensions dir which ended up at /srv/doc/doc2/cover-extensions [18:05:20] now try the rsync again i guess [18:05:26] ok [18:06:55] la [18:07:23] https://doc.wikimedia.org/cover/ !! [18:07:25] works :] [18:07:28] great [18:07:45] so essentially we have killed T149924 the oldest CI bug around dating from 2016 \o/ [18:07:45] T149924: Clear /srv/.git on contint1001; move integration.wikimedia.org docroot to new location - https://phabricator.wikimedia.org/T149924 [18:07:58] which has been a tech debt since well forever [18:08:07] nice:) [18:08:20] we get the docroot code deployed with scap [18:08:22] httpbb [18:08:30] the ci published doc split to their own little standalone dir [18:08:39] which will make our life ten time easier! [18:10:05] last thing is to drop the backup of the old dir which is no more used / empty https://gerrit.wikimedia.org/r/c/operations/puppet/+/625649 :] [18:11:42] (03PS5) 10Hashar: doc: remove legacy doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625650 (https://phabricator.wikimedia.org/T149924) [18:13:01] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) a:05RobH→03ayounsi Jin has the new router, but we're going to wait until Arzhel returns in January to swap this out. I'm planning for the first or second week of January, and advised Juniper... [18:13:06] mutante: can you stop the bakcup of the old data dir ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/625649/ ) and drop the related puppet config ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/625650/ ) ? I will do the cleanup on the server [18:13:10] and after that we are all set! [18:14:08] hashar: I can remove the backup::set code, I cannot stop jobs manually on bacula [18:14:17] unless there is a reason to get into that [18:15:14] yeah maybe I should check with Jaime later [18:15:23] I don't know what is the impact of dropping a backup::set [18:15:39] dropping backup::set has no impact [18:15:56] but deleting "bacula::director::fileset" in the same change [18:16:07] could cause an issue potentially [18:16:16] so I should split those ? [18:16:23] and get the fileset removal done by Jaime? [18:16:46] yes, split them. [18:16:56] either that or simply we merge it later [18:18:02] (03PS6) 10Hashar: doc: unconfigure legacy backup::set [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) [18:18:13] just so the backup server isn't looking for a non-existing fileset [18:18:35] (03CR) 10Dzahn: [C: 03+2] doc: unconfigure legacy backup::set [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [18:18:41] mutante: done :] [18:18:45] (03PS6) 10Hashar: doc: remove legacy doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625650 (https://phabricator.wikimedia.org/T149924) [18:20:24] (03PS1) 10Hashar: bacula: remove unused srv-docroot-org-wikimedia-doc [puppet] - 10https://gerrit.wikimedia.org/r/650214 (https://phabricator.wikimedia.org/T149924) [18:20:52] (03PS1) 10Ahmon Dancy: Redirect top level URl to https://dockerregistry.toolforge.org/ [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) [18:21:31] hashar: Notice: /Stage[main]/Profile::Doc/File[/srv/docroot/org/wikimedia/doc]/ensure: created [18:21:40] yeah [18:22:00] just saying puppet still re-creates that [18:22:00] that is removed from puppet with https://gerrit.wikimedia.org/r/c/operations/puppet/+/625650 [18:22:05] ack [18:22:08] puppet recreated it cause we moved it to /srv/doc [18:22:10] (03PS2) 10Ahmon Dancy: Redirect top level URl to https://dockerregistry.toolforge.org/ [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) [18:22:15] that was a large series of changes :D [18:22:28] (03PS3) 10Ahmon Dancy: Redirect top level URL to https://dockerregistry.toolforge.org/ [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) [18:22:32] when I said the backup::set removal has no impact... it means there is literally no puppet change on doc1001 or backup1001 [18:22:49] but that's the usual puppet thing.. stuff does not get removed magically [18:23:10] jouncebot: next [18:23:10] In 0 hour(s) and 36 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T1900) [18:23:13] but it's not like you want to stop all backups.. so that's ok [18:23:17] I am counting 19 changes, not counting the sub tasks! [18:25:53] mutante: so we are left with https://gerrit.wikimedia.org/r/c/operations/puppet/+/625650 and we can close the task finally :] [18:28:17] yes, I am reading it, please no rush. making a backup [18:28:54] oh good point [18:29:07] cause usrely configuring a backup but not running it initially is certainly useless [18:29:13] I completely missed that one [18:32:00] well, that's another thing. I mean saving the old stuff before removing it. But is there a new backup::set? [18:32:41] (03CR) 10Niharika29: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [18:33:59] mutante: yes there is [18:34:21] mutante: we had set it up a few months ago, that is why the otherwise empty /srv/doc dir had a BACKMEUP dummy file [18:34:45] (03PS7) 10Tchanders: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) [18:34:46] (03PS6) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) [18:37:12] hashar: alright, merged and now manually deleting /srv/docroot [18:37:20] (there is backup in /root) [18:37:35] awesome [18:37:52] !log doc1001 rm -rf /srv/docroot [18:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:02] to remove the backup::fileset I should loop it with Jaime right ? https://gerrit.wikimedia.org/r/c/operations/puppet/+/650214 [18:38:13] The defined FileSet resources are: [18:38:13] 1: srv-doc [18:38:13] 2: srv-docroot-org-wikimedia-doc [18:38:19] (03CR) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [18:38:24] hashar: ^ this is from bacula console when restoring stuff [18:38:30] as you can see both file sets are there [18:38:43] from the past backup runs [18:39:03] and you ran a backup for srv-doc fileset right? [18:39:12] now if I switch to the first one and move in the virtual file system [18:39:18] I can see that "BACKMEUP" file [18:39:37] (03CR) 10DannyS712: Add IPInfo extension config to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [18:39:50] no, that will happen when Bacula scheduled it [18:40:06] so we can leave that as is [18:40:11] and do the cleanup next week/year [18:40:28] (03PS2) 10Awight: Migrate TemplateWizard to full "new" events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650093 (https://phabricator.wikimedia.org/T238230) [18:40:41] (03CR) 10Awight: "Thanks for the explanations!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650093 (https://phabricator.wikimedia.org/T238230) (owner: 10Awight) [18:41:09] (03PS3) 10Gergő Tisza: Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) [18:41:18] (03CR) 10Gergő Tisza: Configure GrowthExperiments on Bangla Wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [18:41:28] (03PS4) 10Gergő Tisza: Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) [18:41:48] hashar: I am telling Bacula to run it at .. [18:41:58] When: 2020-12-17 18:41:25 [18:41:59] (03PS7) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) [18:42:06] Job queued. JobId=290810 [18:42:22] so now we just wait a little bit now and check again [18:43:07] yes, removing the fileset from bacula itself can happen later and is pretty minor cleanup [18:43:13] great [18:43:37] (03CR) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [18:43:58] (03CR) 10Hashar: "Backup of the new file set srv-doc is ongoing once completed this can be applied to drop the old fileset :)" [puppet] - 10https://gerrit.wikimedia.org/r/650214 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [18:44:09] (03PS2) 10Razzi: role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) [18:47:42] mutante: I have closed the task :] Well done! [18:49:16] Re: "removing the fileset from bacula itself can happen later and is pretty minor cleanup", in general we don't remove thing from bacula, but maybe you mean configuration? [18:50:01] old datasets will be available for a while, will be recycled with time [18:50:07] (03CR) 10Niharika29: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [18:51:03] (03CR) 10Niharika29: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [18:52:05] jynus: we mean removing the definition of a fileset from backup server once it's not used anymore [18:52:13] ah, sure [18:52:35] what I did not want to do was merge _in the same patch_ removing a backup::set from a client AND the fileset it uses from the server [18:53:00] just in case then it fails looking for a missing set [18:53:03] yeah, the fileset is low prio [18:53:08] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10hashar) [18:53:12] it only is important if it is used [18:53:16] ack [18:53:33] hashar: glad it's done :) [18:53:58] double check on the bacula dashboard that you still have backup after all changes :-) [18:54:16] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10hashar) > The deployment-puppetdb03 instance has just 2 GB of memory, I guess we can get it resized to a slightly... [18:54:37] jynus: I went to bconsole and used "run" to tell bacula to schedule a new job.. and then i just wait and check again later if the files are there now [18:54:46] cool then [18:55:39] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10wiki_willy) a:03Cmjohnson Arrived on Dec 12 [18:55:48] mutante: I was suggesting this as an alternative: https://grafana.wikimedia.org/d/413r2vbWk/bacula [18:56:12] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10wiki_willy) a:03Cmjohnson Arrived on Dec 12 [18:56:13] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:40] hashar: ^ what jynus said [18:56:58] specially for non-roots indeed [18:57:00] https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=doc1001.eqiad.wmnet-Monthly-1st-Tue-production-srv-doc [18:57:04] that's the one [18:57:06] for example, I can see doc1001.eqiad.wmnet-Monthly-1st-Tue-production-srv-docroot-org-wikime is no more [18:57:21] so I guess the one you link replaced it [18:57:31] (03CR) 10Mstyles: [C: 03+2] Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [18:57:33] it existed before.. but now there are a lot more files in it [18:58:09] jynus: yes, that matches https://gerrit.wikimedia.org/r/c/operations/puppet/+/650214 so we can as well just merge it now [18:59:03] (03Merged) 10jenkins-bot: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [18:59:29] (03CR) 10Dzahn: [C: 03+2] bacula: remove unused srv-docroot-org-wikimedia-doc [puppet] - 10https://gerrit.wikimedia.org/r/650214 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T1900). [19:00:04] tgr, Tchanders, and Pchelolo: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:28] jynus: thank you :] [19:00:30] Notice: /Stage[main]/Bacula::Director/File[/etc/bacula/conf.d/fileset-srv-docroot-org-wikimedia-doc.conf]/ensure: removed [19:00:35] I can deploy today (unless someone else wants to) [19:00:36] ran puppet on backup1001 [19:00:43] no issue but set is gone [19:00:58] Tchanders: hi, around? :-) [19:01:03] o/ [19:01:12] mutante: another trick is to use check_bacula.py it has a nice output for quick checks: https://phabricator.wikimedia.org/P13594 [19:01:12] Hello! [19:01:15] hashar: so you can check the grafana link later yourself [19:01:25] (03CR) 10Urbanecm: [C: 03+2] Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [19:01:28] hi all [19:01:43] mutante: last backup 6002709680 bytes [19:01:52] jynus: thanks, the non-root option is the best one for this case [19:01:54] (same info will be on grafana) [19:01:58] indeed [19:02:17] tgr_: as the patch is not prod-testable, I'll straight sync [19:02:32] mutante: jynus: perfect thank you for helping slash that 4 years old take debt we had. Iguess we can leave room for the backport window now. I will have dinner :] [19:02:50] (03Merged) 10jenkins-bot: Configure GrowthExperiments on Bangla Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649586 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [19:02:52] hashar: mutante <3 [19:03:12] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:18] termination status: 84, which is the ord('T' from terminated)- it is al clear in bacula's mind :-) [19:03:28] cool. :) alright guys, also glad we got the old thing cleaned up. let's all have food [19:04:02] (03CR) 10Urbanecm: [C: 03+2] Article: view from old revision cache - set correct revId. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649923 (https://phabricator.wikimedia.org/T270361) (owner: 10Ppchelko) [19:04:19] backup "last taken: 4 minutes ago" :) [19:04:33] (03CR) 10Urbanecm: [C: 03+2] extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:05:45] (03Merged) 10jenkins-bot: extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:05:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: b24b6f5e99d63e0ed28132fb73c290a87a16e9a2: Configure GrowthExperiments on Bangla Wikipedia [noop for prod] (T266020) (duration: 01m 03s) [19:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] T266020: Deploy Growth experiments at Bangla Wikipedia - https://phabricator.wikimedia.org/T266020 [19:06:10] tgr_: done [19:06:26] thanks Urbanecm! [19:06:31] no problem [19:07:04] !log nskaggs@cumin1001 Added views for new wiki: eowikivoyage T269427 [19:07:04] !log nskaggs@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [19:07:05] (03PS4) 10Urbanecm: Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:08] T269427: Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 [19:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:14] (03CR) 10Urbanecm: [C: 03+2] Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:08:02] !log urbanecm@deploy1001 Synchronized wmf-config/extension-list: 5180f12ee74feae1024ae4ec45e19cd775c805b2: extension-list: Add IPInfo extension (T260599) (duration: 01m 03s) [19:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:06] T260599: Deploy IP Info extension to beta cluster - https://phabricator.wikimedia.org/T260599 [19:08:08] (03Merged) 10jenkins-bot: Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:09:00] (03PS8) 10Urbanecm: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:09:15] (03CR) 10Urbanecm: [C: 03+2] Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:09:18] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:21] (03PS8) 10Urbanecm: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:09:35] (03CR) 10Urbanecm: [C: 03+2] Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:09:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 54046876d2217200618236e22f77d2138345bb92: Add IPInfo config to InitialiseSettings.php [noop for prod] (T260599) (duration: 01m 03s) [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:19] (03Merged) 10jenkins-bot: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:11:18] (03Merged) 10jenkins-bot: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [19:11:59] Tchanders: your extension should be live at beta within ~30 minutes (there's no way how to easily affect that). Please do ping me if it doesn't happen for any reason :) [19:12:08] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 2c1b339b47454e5df4f78b87ca577afb1d2f32af: Load IPInfo extension in CommonSettings.php [noop for prod] (T260599) (duration: 01m 04s) [19:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:20] Ok - thanks Urbanecm! [19:12:32] no problem! [19:13:07] Urbanecm: please ping me when on mwdebug! I can actually test this one. [19:14:03] Pchelolo: sure. I'm actually done with all the other patches already, do you want to self-service, or should I deploy this one too? [19:14:24] could you please? [19:14:41] Pchelolo: sure [19:15:01] waiting on CI then [19:16:56] (03PS2) 10Urbanecm: add wikitech to mediawiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649874 (https://phabricator.wikimedia.org/T270284) (owner: 10ArielGlenn) [19:17:02] (03CR) 10Nray: [C: 03+1] Revert "vue: Log component errors" [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649924 (owner: 10Phuedx) [19:17:16] (03PS3) 10Urbanecm: add wikitech to mediawiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649874 (https://phabricator.wikimedia.org/T270284) (owner: 10ArielGlenn) [19:17:38] (03CR) 10Urbanecm: [C: 03+2] "developer-managed wiki, sounds like a sensible thing to do => let's do it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649874 (https://phabricator.wikimedia.org/T270284) (owner: 10ArielGlenn) [19:19:31] (03Merged) 10jenkins-bot: add wikitech to mediawiki import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649874 (https://phabricator.wikimedia.org/T270284) (owner: 10ArielGlenn) [19:21:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 87cce8411f87a9668b47c0f5ff692fdc96e6255a: add wikitech to mediawiki import sources (T270284) (duration: 01m 04s) [19:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:24] T270284: Add wikitech to import sources for mediawiki - https://phabricator.wikimedia.org/T270284 [19:34:27] thanks for setting up the beta bn config pages, Urbanecm! [19:34:47] no problem, just wanted to try it out while waiting for CI :D [19:36:31] (03Merged) 10jenkins-bot: Article: view from old revision cache - set correct revId. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/649923 (https://phabricator.wikimedia.org/T270361) (owner: 10Ppchelko) [19:36:35] finally [19:36:39] Pchelolo: still around? :) [19:36:44] yup [19:37:11] Pchelolo: pulled to mwdebug1001 [19:37:16] one sec [19:37:36] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) It works! @LSobanski :-) {F33948363} I have now done the... [19:37:55] all good Urbanecm [19:38:01] syncing then [19:39:41] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.22/includes/page/Article.php: 6c97eede7b02b6999b66b150a7d7303515b713ae: Article: view from old revision cache - set correct revId (T270361) (duration: 01m 04s) [19:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:45] and...here you go Pchelolo :) [19:39:45] T270361: "Edit" link on old revisions of a page links to the latest revision instead of the revision being viewed - https://phabricator.wikimedia.org/T270361 [19:39:47] anything else? [19:39:54] thank you Urbanecm! [19:39:59] any time! [19:41:15] !log Morning B&C window done [19:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:52] Urbanecm: The messages for the next extension that Tchanders just got deployed still appear as the message keys. Do you know how often does the localization build happen for beta? [19:43:52] Urbanecm it is? Do you have time for a security patch? [19:44:12] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Migrate WDQS to Debian Buster - https://phabricator.wikimedia.org/T244753 (10RKemper) This is all done. No issues currently. [19:53:55] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10Legoktm) [19:57:11] (03PS12) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [19:58:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [20:00:05] marxarelli and longma: Your horoscope predicts another unfortunate Mediawiki train - American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201217T2000). [20:04:56] (03PS1) 10Gergő Tisza: Get rid of GrowthExperiments morelike mode on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650241 (https://phabricator.wikimedia.org/T266020) [20:05:50] (03PS13) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [20:06:52] (03PS2) 10Gergő Tisza: [beta] Get rid of GrowthExperiments morelike mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650241 (https://phabricator.wikimedia.org/T266020) [20:07:06] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) I looked at Grafana and CPU / RAM usage of doc1001 over the last 6 months and it seems WAY overprovisioned with 4 CPUs and 4GB RAM. https://grafana.wikimedia.org/d/000000377/host-overview?o... [20:08:00] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) regarding disk space: using 38% of the 150GB. Let's reduce it to 2 CPUs / 2GB RAM and 120 GB [20:08:53] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) [20:13:58] (03PS2) 10Razzi: kafka: add remaining nodes to kafka test cluster [puppet] - 10https://gerrit.wikimedia.org/r/649894 (https://phabricator.wikimedia.org/T268202) [20:14:40] (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650243 [20:14:42] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650243 (owner: 10Dduvall) [20:15:10] thanks for the merge and deploy, Urban ecm! [20:15:29] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650243 (owner: 10Dduvall) [20:16:44] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.22 [20:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:31] !log all wikis to 1.36.0-wmf.22 complete. no new errors or concerning rates (refs T267415) [20:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:35] T267415: 1.36.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T267415 [20:31:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-deploy100[1-4] - https://phabricator.wikimedia.org/T267955 (10RobH) 05Open→03Invalid this is accidental dupe of T267050, so invalidating [20:31:53] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) The current theory is that the problem boils down to the following HEAD request... [20:31:58] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:32:04] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [20:34:55] (03PS1) 10Ema: vcl: do not gzip responses to HEAD requests [puppet] - 10https://gerrit.wikimedia.org/r/650248 (https://phabricator.wikimedia.org/T270270) [20:35:36] (03CR) 10Razzi: [C: 03+2] kafka: add remaining nodes to kafka test cluster [puppet] - 10https://gerrit.wikimedia.org/r/649894 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:40:59] 10Operations, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Anass_Sedrati) [20:41:16] (03CR) 10Ema: [C: 03+2] vcl: do not gzip responses to HEAD requests [puppet] - 10https://gerrit.wikimedia.org/r/650248 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [20:43:48] 10Operations, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Majavah) Hi, have you asked Dzahn before assigning this task to them? [20:46:35] 10Operations, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Anass_Sedrati) Hello @Majavah, I am unfortunately not familiar with Pahbricator, so I did not know how this post will get attention if it is not assigned to anyone. I j... [20:50:16] 10Operations, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Anass_Sedrati) a:05Dzahn→03None [20:55:31] PROBLEM - Check systemd state on kafka-test1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:34] (03PS1) 10Ema: vcl: do not gzip docker-registry responses [puppet] - 10https://gerrit.wikimedia.org/r/650256 (https://phabricator.wikimedia.org/T270270) [20:56:16] 10Operations, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Majavah) Please see https://mediawiki.org/wiki/Bug_management/Phabricator_etiquette - it's up to an individual to decide to what plan to work on. In this case, this ti... [20:58:27] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10ssingh) a:03RLazarus [20:59:03] (03CR) 10Ema: [C: 03+2] vcl: do not gzip docker-registry responses [puppet] - 10https://gerrit.wikimedia.org/r/650256 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [21:00:33] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Anass_Sedrati) Thank you @Majavah for your clarifications. I apologize, I was not aware about this. My intention was simply to tag someone in... [21:01:50] !log cp3052: ban 'req.http.host == "docker-registry.wikimedia.org"' T270270 [21:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:53] T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 [21:06:03] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [21:06:03] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [21:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:40] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [21:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:06] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) >>! In T270270#6699691, @ema wrote: > That being said, I suspect that our VCL trying to do... [21:10:32] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10kostajh) >>! In T270270#6699794, @ema wrote: >>>! In T270270#6699761, @gerritbot wrote: >> Chan... [21:12:08] (03CR) 10Ema: vcl: do not stream responses to docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650156 (https://phabricator.wikimedia.org/T270270) (owner: 10Ema) [21:14:25] (03PS1) 10Ema: Revert "vcl: do not stream responses to docker" [puppet] - 10https://gerrit.wikimedia.org/r/650191 [21:14:38] (03CR) 10Cwhite: [C: 03+2] profile: add normalize_level filter script [puppet] - 10https://gerrit.wikimedia.org/r/649956 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:15:19] (03PS2) 10Ema: Revert "vcl: do not stream responses to docker" [puppet] - 10https://gerrit.wikimedia.org/r/650191 (https://phabricator.wikimedia.org/T270270) [21:15:36] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10RLazarus) Hi @Anass_Sedrati, happy to work on this. :) Rather than create a new mailing list, we'd rather get the existing `wikimedia-ma` back... [21:16:30] 10Operations, 10MediaWiki-Docker, 10Traffic, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) >>! In T270270#6699806, @gerritbot wrote: > Change 650191 had a related patch set uploaded... [21:18:52] (03CR) 10Cwhite: [C: 03+2] Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite) [21:19:30] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) p:05Triage→03Medium [21:19:35] (03Merged) 10jenkins-bot: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite) [21:19:59] 10Operations, 10vm-requests: codfw: 1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn) p:05Triage→03Medium [21:20:06] 10Operations, 10vm-requests: codfw: 1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn) a:03Dzahn [21:21:30] PROBLEM - Check systemd state on kafka-test1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:24] PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:27] (03PS4) 10Cwhite: profile: identify network devices logging input [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) [21:41:32] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:41:40] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:42:03] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10Anass_Sedrati) Dear @RLazarus, This is definitely a good solution. We have sent a number of emails earlier regarding that mail list, but were... [21:48:24] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10srodlund) @ema I published this. Will you look it over and let me know if you see anything that needs changing or fixing before I... [21:49:57] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: New mailing list requested for Wikimedia MA User Group - https://phabricator.wikimedia.org/T270434 (10RLazarus) Perfect! That's done, then -- both list owners should have just received an automated email containing the new admin password. Please... [21:50:30] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Update admin password for Wikimedia-MA mailing list - https://phabricator.wikimedia.org/T270434 (10RLazarus) [21:52:58] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Update admin password for Wikimedia-MA mailing list - https://phabricator.wikimedia.org/T270434 (10Anass_Sedrati) Dear @RLazarus , Thank you very much for the quick support. I confirm the reception of the email, and even that I could re-access... [21:54:40] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10Aklapper) The bottom says that `This post is part 2 of a 3 part series.` (Plus I wonder if `million` and `billion` should really... [21:55:32] (03PS5) 10Cwhite: profile: identify network devices logging input [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) [21:56:23] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Update admin password for Wikimedia-MA mailing list - https://phabricator.wikimedia.org/T270434 (10RLazarus) 05Open→03Resolved Great! Feel free to reopen, or file a new ticket under #wikimedia-mailing-lists, if you need anything else. [21:57:29] 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 4 wiki's - https://phabricator.wikimedia.org/T270443 (10Jseddon) [22:00:16] (03PS1) 10Seddon: Undeploy graphoid for arwiki. Phase 4. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650269 (https://phabricator.wikimedia.org/T270443) [22:00:41] (03CR) 10Cwhite: [C: 03+2] profile: identify network devices logging input [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) (owner: 10Cwhite) [22:01:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:03:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:04:07] (03PS1) 10Cwhite: profile: remove errant commas from template_syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/650270 [22:04:24] (03PS2) 10Cwhite: profile: remove errant commas from template_syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/650270 [22:04:38] (03CR) 10Cwhite: [V: 03+2 C: 03+2] profile: remove errant commas from template_syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/650270 (owner: 10Cwhite) [22:05:18] (03PS1) 10RLazarus: admin: Add lilients to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650271 (https://phabricator.wikimedia.org/T264590) [22:06:20] 10Operations, 10Technical-blog-posts, 10Traffic: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10srodlund) Ah! Thank you for catching that. I fixed both of these. [22:06:26] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10RLazarus) p:05Triage→03Medium a:03RLazarus [22:07:59] (03CR) 10Ssingh: [C: 03+1] "+1, uid matches, user has signed NDA on record." [puppet] - 10https://gerrit.wikimedia.org/r/650271 (https://phabricator.wikimedia.org/T264590) (owner: 10RLazarus) [22:09:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [22:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:00] (03PS1) 10RLazarus: admin: Add mttp to admin_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) [22:11:14] (03CR) 10RLazarus: [C: 03+2] admin: Add lilients to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650271 (https://phabricator.wikimedia.org/T264590) (owner: 10RLazarus) [22:13:36] (03CR) 10Dzahn: [C: 03+1] "looks good, UID matches LDAP and corp-LDAP says employeeType: Full Time. just needs manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [22:15:02] (03CR) 10Ssingh: "It's possible I am missing something but did you mean ldap_only_users and not admin_only_users in the commit message?" [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [22:15:39] (03PS2) 10RLazarus: admin: Add mttp to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) [22:15:52] (03CR) 10Dzahn: [C: 03+1] "..or doesn't need it.. but it looks ok either way" [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [22:15:54] (03CR) 10jerkins-bot: [V: 04-1] admin: Add mttp to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [22:20:15] (03PS3) 10RLazarus: admin: Add mttp to admin_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) [22:21:44] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [22:21:46] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [22:21:49] (03PS4) 10RLazarus: admin: Add mttp to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) [22:21:58] logstash ^^ is me [22:23:13] (03PS1) 10Razzi: Add fake kerberos keytabs for an-tool1010 [labs/private] - 10https://gerrit.wikimedia.org/r/650275 [22:23:20] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [22:23:22] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [22:24:34] (03CR) 10RLazarus: [C: 03+2] "Love to merge conflict with myself... thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/650272 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [22:28:41] (03CR) 10Razzi: [V: 03+2 C: 03+2] Add fake kerberos keytabs for an-tool1010 [labs/private] - 10https://gerrit.wikimedia.org/r/650275 (owner: 10Razzi) [22:29:42] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10RLazarus) 05Open→03Resolved This is done: ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmde | grep lilients member: uid=lilients,ou=people,dc=wikimedia,... [22:29:47] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10RLazarus) [22:30:20] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:36] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27194/console" [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) (owner: 10Razzi) [22:30:44] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [22:30:54] (03PS3) 10Jeena Huneidi: Add mw-cli to the releases server [puppet] - 10https://gerrit.wikimedia.org/r/649958 (https://phabricator.wikimedia.org/T250241) [22:31:56] RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:20] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [22:34:01] (03CR) 10Jeena Huneidi: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/649931 (owner: 10PipelineBot) [22:35:05] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10RLazarus) 05Open→03Resolved Hi @MPhamWMF, welcome to the Foundation! Your wikitech username was indeed the right one (thanks!) and delightfully I used it to... [22:35:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:35:43] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/649931 (owner: 10PipelineBot) [22:37:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:59] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10Ottomata) Hello! @MPhamWMF will also need a posix account in the analytics-privatedata-users group to access some Superset dashboards (those based on Presto ins... [22:47:27] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [22:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:01] (03CR) 10Cwhite: [C: 03+2] profile: deploy filter_scripts directory to logstash 7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/650170 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:49:07] (03PS2) 10Cwhite: profile: deploy filter_scripts directory to logstash 7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/650170 (https://phabricator.wikimedia.org/T234565) [22:51:54] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [22:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:15] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27195/console" [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) (owner: 10Razzi) [22:54:23] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10RLazarus) 05Resolved→03Open Why, I'd swear that wasn't there yesterday... I mean, can do! In that case, in addition to Analytics approval (thanks @Ottomata... [22:59:29] (03PS1) 10Bstorm: kubeadm and paws: tuning options for stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/650280 (https://phabricator.wikimedia.org/T267966) [23:04:41] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:36] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:13:08] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:22:02] (03CR) 10Bstorm: "It may be worthwhile to include a compaction variable in here as well. The default is pretty frequent for when we are having io issues." [puppet] - 10https://gerrit.wikimedia.org/r/650280 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [23:22:43] (03PS1) 10RLazarus: admin: Add mttp to analytics-privatedata-users, but with no SSH. [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) [23:23:11] (03CR) 10RLazarus: [C: 04-2] "This is still pending manager approval on the ticket, so I won't merge it yet, but I wanted to check about the implementation, since this " [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [23:23:54] (03Abandoned) 10Bstorm: toolsdb: Fail over toolsdb to its replica [puppet] - 10https://gerrit.wikimedia.org/r/636468 (https://phabricator.wikimedia.org/T263679) (owner: 10Bstorm) [23:26:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10JKatzWMF) @RLazarus approved, thanks! [23:30:44] (03CR) 10RLazarus: "> From [1] I take it Mike *should* be a member of analytics-privatedata-users, but *doesn't* need SSH access, so I've promoted him to full" [puppet] - 10https://gerrit.wikimedia.org/r/650298 (https://phabricator.wikimedia.org/T270438) (owner: 10RLazarus) [23:33:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:34:36] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10MPhamWMF) Awesome. Just tried it and it worked. Thanks everyone! [23:35:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:36:43] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Mike Pham - https://phabricator.wikimedia.org/T270438 (10RLazarus) Great! Pardon a brief delay in getting you set up with the additional access @Ottomata mentioned -- you're the guinea pig for a new arrangement, so I'd... [23:37:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) a:05Dzahn→03Cmjohnson Assigning to you to find out if we need to keep this open or not. If it's done, feel free to close as resolved... [23:38:40] (03PS7) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [23:39:03] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [23:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:34] (03CR) 10Jforrester: Add IPInfo extension config to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [23:56:53] (03CR) 10RLazarus: [C: 03+2] Add mw-cli to the releases server [puppet] - 10https://gerrit.wikimedia.org/r/649958 (https://phabricator.wikimedia.org/T250241) (owner: 10Jeena Huneidi)