[00:20:39] (03PS1) 10Tim Starling: [1.36.0-wmf.14] CategoryChangesAsRdfTest: Skip testCategorization [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639932 (https://phabricator.wikimedia.org/T266850) [00:21:03] (03CR) 10Tim Starling: [C: 03+2] [1.36.0-wmf.14] CategoryChangesAsRdfTest: Skip testCategorization [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639932 (https://phabricator.wikimedia.org/T266850) (owner: 10Tim Starling) [00:23:03] 10Operations, 10observability: Monitor internal CA expirations - https://phabricator.wikimedia.org/T171157 (10Aklapper) >>! In T171157#3538015, @faidon wrote: > Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options. @a... [00:43:57] (03Merged) 10jenkins-bot: [1.36.0-wmf.14] CategoryChangesAsRdfTest: Skip testCategorization [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639932 (https://phabricator.wikimedia.org/T266850) (owner: 10Tim Starling) [01:02:10] (03PS2) 10Tim Starling: Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639955 (https://phabricator.wikimedia.org/T264333) [01:15:35] (03CR) 10Tim Starling: [C: 03+2] "Not sure if CI is running..." [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639955 (https://phabricator.wikimedia.org/T264333) (owner: 10Tim Starling) [01:29:16] !log tstarling@deploy1001 sync-file aborted: fixing UBN T266903 (duration: 00m 01s) [01:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:24] T266903: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 [01:30:32] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.14/resources/src/mediawiki.Upload.js: fixing UBN T266903 (duration: 01m 07s) [01:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:54] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.14/resources/src/mediawiki.api/upload.js: fixing UBN T266903 (duration: 01m 06s) [01:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:19] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.14/tests/phpunit/maintenance/categoryChangesAsRdfTest.php: this was cherry-picked to make CI pass, pushing it out just for a clean staging dir (duration: 01m 06s) [01:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:14] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 53325 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [01:48:10] 10Operations, 10Commons, 10SRE-swift-storage, 10MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), 10Patch-For-Review: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10tstarling) 05Open→03Resolved a:03tstar... [02:18:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:28] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:20] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10elukey) As FYI I'd need to advertise the host maintenance a couple of days in advance to users, so happy to work with Chris/John in any time... [07:17:47] !log restart gerrit on gerrit2001 (OOM registered for two days ago, uptime from systemctl since a month ago, probably in a weird state) [07:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:10] (03CR) 10Filippo Giunchedi: "Deployment consists of running puppet on ms-fe* hosts, then depool + restart swift-proxy on said hosts" [puppet] - 10https://gerrit.wikimedia.org/r/638109 (https://phabricator.wikimedia.org/T266155) (owner: 10Effie Mouzeli) [07:54:14] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: pass the 'X-Client-IP' header to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/638109 (https://phabricator.wikimedia.org/T266155) (owner: 10Effie Mouzeli) [07:57:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "certspotter: temporarily disable cron job" [puppet] - 10https://gerrit.wikimedia.org/r/475453 (https://phabricator.wikimedia.org/T204993) (owner: 10Alex Monk) [07:57:28] (03PS4) 10Muehlenhoff: admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [07:58:19] !log Restarted CI Jenkins on contint2001 for Java upgrade [07:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:01] (03CR) 10Muehlenhoff: [C: 03+2] admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [08:05:36] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff [08:07:15] I am going to restart Gerrit in a few for a java ugprade [08:08:33] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10elukey) >>! In T267065#6608905, @Jclark-ctr wrote: > @elukey Hey when you get a chance can you let me know best day i can schedule with you some movies next week? I had a... [08:10:33] (03PS1) 10Filippo Giunchedi: hieradata: re-enable swiftrepl in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/640062 [08:11:07] !log Restarting Gerrit on gerrit1001 and gerrit2001 [08:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:35] hashar: lol I had to restart gerrit on 2001 earlier since it was stuck, bad timing! [08:20:58] haha [08:21:36] (03PS1) 10Hashar: gerrit: remove Velocity log configuration [puppet] - 10https://gerrit.wikimedia.org/r/640066 [08:22:29] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/26364/" [puppet] - 10https://gerrit.wikimedia.org/r/640062 (owner: 10Filippo Giunchedi) [08:24:12] !log configure traceoptions on pfw3-eqiad - T263833 [08:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] (03PS1) 10Giuseppe Lavagetto: Move git-review to Recommends [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/640067 [08:29:09] (03PS5) 10David Caro: [apt::conf] Allow passing integers as value [puppet] - 10https://gerrit.wikimedia.org/r/639778 [08:31:22] (03PS3) 10Muehlenhoff: Add a Hiera option to enable ICU63 component [puppet] - 10https://gerrit.wikimedia.org/r/639751 (https://phabricator.wikimedia.org/T264991) [08:31:49] (03PS1) 10Filippo Giunchedi: swift: ensure /var/log/swiftrepl is a directory [puppet] - 10https://gerrit.wikimedia.org/r/640068 [08:33:45] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26365" [puppet] - 10https://gerrit.wikimedia.org/r/640068 (owner: 10Filippo Giunchedi) [08:34:43] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: ensure /var/log/swiftrepl is a directory [puppet] - 10https://gerrit.wikimedia.org/r/640068 (owner: 10Filippo Giunchedi) [08:34:54] 10Operations, 10Traffic, 10netops: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) Juniper opened Enhancement Request ER-081995 to address it. Not sure that will be useful to us by the time it's implemented, but at least it could help people in the future. [09:00:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [09:00:52] (03PS5) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) [09:01:47] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [09:02:35] (03PS6) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) [09:05:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26366" [puppet] - 10https://gerrit.wikimedia.org/r/639811 (https://phabricator.wikimedia.org/T267017) (owner: 10Cwhite) [09:06:38] !log enable thanos query-frontend on thanos-fe hosts - T261281 [09:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:46] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [09:09:59] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 04-1] "This not only opens the firewall but also adds grafana hosts to the AM cluster: https://puppet-compiler.wmflabs.org/compiler1001/26366/ale" [puppet] - 10https://gerrit.wikimedia.org/r/639811 (https://phabricator.wikimedia.org/T267017) (owner: 10Cwhite) [09:11:48] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Jgiannelos) @LGoto After some debugging on T266373 it looks like the issue is more complicated and affects more users than we asse... [09:19:43] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Jgiannelos) @sdkim Should i bump this to high priority? According to https://phabricator.wikimedia.org/T266373#6608804 there is a b... [09:20:42] (03CR) 10Muehlenhoff: [C: 03+2] Add a Hiera option to enable ICU63 component [puppet] - 10https://gerrit.wikimedia.org/r/639751 (https://phabricator.wikimedia.org/T264991) (owner: 10Muehlenhoff) [09:21:13] (03CR) 10JMeybohm: mediawiki: migrate to debian::codename and ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639786 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [09:26:45] (03PS1) 10Hashar: gerrit: exit the JVM after OutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/640074 (https://phabricator.wikimedia.org/T267517) [09:27:22] (03CR) 10Hashar: "Guillaume has done the same for the ElasticSearch JVM :]" [puppet] - 10https://gerrit.wikimedia.org/r/640074 (https://phabricator.wikimedia.org/T267517) (owner: 10Hashar) [09:35:34] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add ack daemon [puppet] - 10https://gerrit.wikimedia.org/r/638998 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [09:35:45] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: enable acks and silences on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/638999 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [09:35:53] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add prometheus jobs for am acks [puppet] - 10https://gerrit.wikimedia.org/r/639000 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [09:37:00] (03PS2) 10Filippo Giunchedi: profile: add prometheus jobs for am acks [puppet] - 10https://gerrit.wikimedia.org/r/639000 (https://phabricator.wikimedia.org/T266535) [09:37:49] !log installing libexif security updates [09:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:42] Hi everyone! I'm evaluating using spicerack (https://wikitech.wikimedia.org/wiki/Spicerack) to automate stuff around the wikimedia cloud environment, but I seem to be unable to get hold of the repository for wmflib (required by it), I see the docs here (https://doc.wikimedia.org/wmflib/master/index.html) but the repos are not there. Anyone knows where it lives? [09:42:13] found it xd, https://github.com/wikimedia/operations-software-pywmflib -- https://gerrit.wikimedia.org/r/admin/repos/operations/software/pywmflib it seems it has been renamed and the docs still point to the wrong url [09:42:29] dcaro: https://wikitech.wikimedia.org/wiki/Python/Wmflib [09:42:54] dcaro: hi, it has been authored by volans and the wmflib got extracted by kormat iirc [09:43:27] feel free to ping me for more context/information and if you want to discuss what are the options to best approach what you need to do [09:43:47] dcaro: Gerrit has a generic feature to search for repos: "BROWSE" -> "Repositories" and then you can add a filter [09:43:49] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26367/" [puppet] - 10https://gerrit.wikimedia.org/r/640074 (https://phabricator.wikimedia.org/T267517) (owner: 10Hashar) [09:45:08] volans: sure, pm? [09:45:55] sure [09:46:10] moritzm: yep, but interestingly enough it does not find it with "wmflib" only: https://gerrit.wikimedia.org/r/admin/repos/q/filter:wmflib, only pywmflib :S [09:47:12] 10Operations, 10CAS-SSO: Update CAS to 6.2 - https://phabricator.wikimedia.org/T265857 (10fgiunchedi) I was looking at prometheus jobs down alert today and idp shows up there, I'm assuming because the prometheus endpoint has been removed in Ia4b089af. Please remove the IDP prometheus job as well, thanks! [09:47:35] yes it searches with things starting with IIRC and it's definetly not ideal [09:48:00] (03CR) 10Urbanecm: [C: 03+2] "let's do this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639650 (https://phabricator.wikimedia.org/T262689) (owner: 10Urbanecm) [09:48:15] and try adding a * into the filter.... [09:48:54] (03Merged) 10jenkins-bot: Revert "Change votewiki language temporarily to fa for fawiki elections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639650 (https://phabricator.wikimedia.org/T262689) (owner: 10Urbanecm) [09:49:51] moritzm: Error 400 (Bad Request): Cannot create full-text query with value: * [09:49:57] Endpoint: /projects/?* [09:49:59] LOL [09:50:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7b0a81f4294dcedfd5736884900cb561de9a080e: Revert "Change votewiki language temporarily to fa for fawiki elections" (T262689) (duration: 01m 08s) [09:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:41] T262689: Carry out the 2020 fawiki elections on votewiki - https://phabricator.wikimedia.org/T262689 [09:52:33] !log Restarting Gerrit on gerrit1001 and gerrit2001 in order to have the JVM to exit after OutOfMemory # T267517 [09:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:39] T267517: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 [09:53:44] (03PS1) 10Muehlenhoff: Disable apereo_cas_jobs in Prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/640086 (https://phabricator.wikimedia.org/T265857) [09:54:07] volans: I don't what to know what else is hidden in there :-) [09:54:33] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session at mwmaint1002 (wiki=svwiki; T246539) [09:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:40] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [09:56:56] !log Purge https://vote.wikimedia.org/wiki/Main_Page (T262689) [09:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:06] T262689: Carry out the 2020 fawiki elections on votewiki - https://phabricator.wikimedia.org/T262689 [09:57:25] moritzm: me neither! ;) [09:57:42] moritzm: I see... hahaha [09:59:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Disable apereo_cas_jobs in Prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/640086 (https://phabricator.wikimedia.org/T265857) (owner: 10Muehlenhoff) [10:00:20] (03PS2) 10Muehlenhoff: Disable apereo_cas_jobs in Prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/640086 (https://phabricator.wikimedia.org/T265857) [10:01:15] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) Email sent to the community in (very) short notice https://lists.wikimedia.org/pipermail/cloud-announce/2020-November/000339.html [10:06:12] (03PS1) 10Jberkel: Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) [10:08:16] (03CR) 10Muehlenhoff: [C: 03+2] Disable apereo_cas_jobs in Prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/640086 (https://phabricator.wikimedia.org/T265857) (owner: 10Muehlenhoff) [10:09:45] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640088 (https://phabricator.wikimedia.org/T265446) [10:11:30] !log renumber cloud-xlink1-eqiad [10:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:54] (03CR) 10Klausman: [C: 03+2] res: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639794 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:23:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:35] 10Operations, 10observability: Monitor internal CA expirations - https://phabricator.wikimedia.org/T171157 (10akosiaris) 05Stalled→03Declined >>! In T171157#6611644, @Aklapper wrote: >>>! In T171157#3538015, @faidon wrote: >> Setting to stalled until we decide what to actually do with the internal CA, as w... [10:25:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This was per my recommendation at https://github.com/wikimedia/ores/pull/352#issuecomment-723127020, so merging" [puppet] - 10https://gerrit.wikimedia.org/r/639898 (https://phabricator.wikimedia.org/T263910) (owner: 10Ladsgroup) [10:28:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:50] (03CR) 10Jbond: [C: 03+2] P:cumin::master: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:30:54] (03PS1) 10HitomiAkane: Adding 'eliminator' group to arz.wikipedia and grant permissions to add/remove an 'eliminator' right to the local sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640089 [10:31:39] jbond42: did you see my comment on the above patch? [10:31:57] I'm afraid it might break in WMCS instances [10:32:08] *break puppet [10:32:09] (03CR) 10Jbond: "Can someone in cloud confirm if we can drop stretch support for cumin?" [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:32:13] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) >>! In T265288#6608514, @aborrero wrote: > Also, I don't see the CIDR object created in netbox for `185.15.56.240/29 `. Could you please create it? Or I can creat... [10:32:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Can't hurt. And with https://gerrit.wikimedia.org/r/639898 already merged, we might have some results soon" [puppet] - 10https://gerrit.wikimedia.org/r/638467 (https://phabricator.wikimedia.org/T263910) (owner: 10Ladsgroup) [10:32:21] volans: yes just cancled my +2 [10:32:29] ack, thx :) [10:32:37] no thanks [10:35:02] !log merging 638109 and roll restart ms-fe* hosts to pick up the change [10:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:13] (03CR) 10Effie Mouzeli: [C: 03+2] swift: pass the 'X-Client-IP' header to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/638109 (https://phabricator.wikimedia.org/T266155) (owner: 10Effie Mouzeli) [10:36:14] (03CR) 10Muehlenhoff: [C: 03+1] P:cumin::master: migrate to debian::codename and ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:36:36] !log add 185.15.56.240/29 IPs to relevant cloudsw interfaces - T265288 [10:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:42] T265288: Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 [10:36:57] (03PS2) 10Jbond: zuul: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639868 (https://phabricator.wikimedia.org/T266479) [10:37:12] (03PS2) 10HitomiAkane: Adding 'eliminator' group to arz.wikipedia and grant permissions to add/remove an 'eliminator' right to the local sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640089 (https://phabricator.wikimedia.org/T267286) [10:37:18] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [10:37:30] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [10:40:14] (03CR) 10Muehlenhoff: [C: 03+2] ntp: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605071 (owner: 10Muehlenhoff) [10:40:28] (03PS4) 10Jbond: P:cumin::master: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) [10:40:46] (03CR) 10Jbond: P:cumin::master: migrate to debian::codename and ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:41:33] (03CR) 10Jbond: [C: 03+2] zuul: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639868 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:43:09] (03PS2) 10Jbond: varnish: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639865 (https://phabricator.wikimedia.org/T266479) [10:43:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26369" [puppet] - 10https://gerrit.wikimedia.org/r/639865 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:45:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26370" [puppet] - 10https://gerrit.wikimedia.org/r/639865 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:46:40] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [10:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26371" [puppet] - 10https://gerrit.wikimedia.org/r/639865 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:47:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] varnish: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639865 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:49:26] (03CR) 10Jbond: [C: 03+2] uwsgi: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639864 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:49:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:27] (03PS1) 10Muehlenhoff: profile::envoy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/640092 (https://phabricator.wikimedia.org/T267396) [10:51:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26372" [puppet] - 10https://gerrit.wikimedia.org/r/639864 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:53:48] (03PS2) 10Jbond: ulog: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639863 (https://phabricator.wikimedia.org/T266479) [10:54:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/640092 (https://phabricator.wikimedia.org/T267396) (owner: 10Muehlenhoff) [10:54:51] (03CR) 10Jbond: [C: 03+2] ulog: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639863 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:56:21] (03CR) 10Jbond: [C: 03+2] thumbor: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639862 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:00:22] (03PS1) 10Ayounsi: Remove 208.80.155.88/29 from cloud4 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/640093 (https://phabricator.wikimedia.org/T265288) [11:02:22] (03PS1) 10JMeybohm: Initial release of upstream binary packages [debs/calico] - 10https://gerrit.wikimedia.org/r/640094 (https://phabricator.wikimedia.org/T266893) [11:02:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:02:49] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) To be pushed on both cloudsw at the same time as the neutron commands (or at least at the same time as the one deleting any `208.80.155.` IP... [11:03:43] (03PS1) 10JMeybohm: Build a calico-images package [debs/calico] - 10https://gerrit.wikimedia.org/r/640095 (https://phabricator.wikimedia.org/T266893) [11:09:40] (03CR) 10Jbond: [C: 03+2] testreduce: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639861 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:09:50] jouncebot: next [11:09:50] In 0 hour(s) and 20 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T1130) [11:10:56] (03CR) 10Urbanecm: [C: 04-2] "Eliminator permissions are sysop-like, as they allow users to exercise all three traditional sysop powers: delete pages, block users and p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640089 (https://phabricator.wikimedia.org/T267286) (owner: 10HitomiAkane) [11:11:42] (03CR) 10Jbond: [C: 03+2] P:cumin::master: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639804 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:12:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26374" [puppet] - 10https://gerrit.wikimedia.org/r/639860 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:12:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] tendril: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639860 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:14:04] (03PS1) 10Arturo Borrero Gonzalez: cloud: refresh cloudinstances2b-gw router address [dns] - 10https://gerrit.wikimedia.org/r/640096 (https://phabricator.wikimedia.org/T265288) [11:19:15] (03PS1) 10Elukey: kerberos: add dns_canonicalize_hostname = false to clients [puppet] - 10https://gerrit.wikimedia.org/r/640100 (https://phabricator.wikimedia.org/T257412) [11:21:37] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10JMeybohm) [11:21:46] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10JMeybohm) p:05Triage→03Low [11:21:49] (03PS2) 10Elukey: kerberos: add dns_canonicalize_hostname = false to clients [puppet] - 10https://gerrit.wikimedia.org/r/640100 (https://phabricator.wikimedia.org/T257412) [11:25:25] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Archive/Remove deprecated calico gerrit repositories - https://phabricator.wikimedia.org/T267539 (10Aklapper) FYI https://phabricator.wikimedia.org/project/profile/2829/ links to https://phabricator.wikimedia.org/maniphest/task/edit/form/33/ cove... [11:29:23] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10ayounsi) p:05Triage→03High [11:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T1130). Please do the needful. [11:33:21] (03CR) 10Arturo Borrero Gonzalez: "the patch LGTM. We need some real-life testing before merging though." [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [11:36:45] (03CR) 10Jbond: mediawiki: migrate to debian::codename and ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639786 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:38:41] (03PS2) 10Jbond: systemd: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639859 (https://phabricator.wikimedia.org/T266479) [11:45:50] (03CR) 10Jbond: [C: 03+2] systemd: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639859 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:46:55] (03CR) 10Jbond: [C: 03+2] swift: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639858 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:47:49] (03CR) 10Jbond: [C: 03+2] sudo: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639857 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:49:23] (03CR) 10Jbond: [C: 03+2] smart: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639856 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:50:04] (03CR) 10Jbond: [C: 03+2] service: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639854 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:50:16] (03CR) 10Jbond: [C: 03+2] rsyslog: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639853 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:53:05] (03PS2) 10Jbond: wmcs: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639852 (https://phabricator.wikimedia.org/T266479) [11:53:17] (03CR) 10Jbond: [C: 03+2] striker: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639851 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:54:00] (03PS2) 10Jbond: simplelap: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639850 (https://phabricator.wikimedia.org/T266479) [11:54:28] (03CR) 10Jbond: [C: 03+2] simplelamp2: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639849 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:54:59] (03CR) 10Jbond: [C: 03+2] puppetmaster::standalone: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639848 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:55:14] (03CR) 10Jbond: [C: 03+2] lists: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639845 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:55:25] (03CR) 10Jbond: [C: 03+2] wmcs: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639852 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:55:32] (03CR) 10Jbond: [C: 03+2] simplelap: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639850 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:57:40] !log restart dbstore1004 mariadb instances [11:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:59] (03CR) 10Jbond: [C: 03+2] O:alerting_host: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639844 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:58:23] !log installing remaining openldap updates on stretch [11:58:26] (03CR) 10Jbond: [C: 03+2] racktables: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639843 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:52] (03CR) 10Jbond: [C: 03+2] query_service: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639841 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:59:19] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler::packages: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639837 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T1200) [12:00:05] MatmaRex: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] (03PS2) 10Jbond: P:prometheus: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639836 (https://phabricator.wikimedia.org/T266479) [12:00:36] I can deploy today! [12:00:37] hello [12:00:39] Hi MatmaRex :) [12:01:03] (03CR) 10Jbond: [C: 03+2] P:prometheus: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639836 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [12:01:28] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools as a beta feature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640088 (https://phabricator.wikimedia.org/T265446) (owner: 10Bartosz Dziewoński) [12:02:35] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640088 (https://phabricator.wikimedia.org/T265446) (owner: 10Bartosz Dziewoński) [12:03:35] MatmaRex: pulled to mwdebug1001, can you have a look, please? [12:03:48] yeah [12:04:30] Urbanecm: seems good [12:04:33] syncing [12:05:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 87b3eede24fb407ddd226ad65817ab8adf44aeb8: Enable DiscussionTools as a beta feature on fiwiki (T265446) (duration: 01m 06s) [12:06:03] MatmaRex: done. Anything else? [12:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:08] T265446: Make config change to enable Reply Tool as Beta Feature on fi.wiki - https://phabricator.wikimedia.org/T265446 [12:06:14] nope [12:06:15] thanks! [12:06:18] Happy to help! [12:07:00] PROBLEM - Check systemd state on db1113 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:02] PROBLEM - Check systemd state on db1096 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:15] (03PS3) 10Urbanecm: Add wgNamespaceAliases for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637870 (https://phabricator.wikimedia.org/T266925) (owner: 10Hamish) [12:07:20] (03CR) 10Urbanecm: [C: 03+2] Add wgNamespaceAliases for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637870 (https://phabricator.wikimedia.org/T266925) (owner: 10Hamish) [12:07:23] (03CR) 10Volans: [C: 03+1] "LGTM for what is related to the Netbox integration, I'll leave to the other reviewers with more context on the actual content." [dns] - 10https://gerrit.wikimedia.org/r/640096 (https://phabricator.wikimedia.org/T265288) (owner: 10Arturo Borrero Gonzalez) [12:07:38] PROBLEM - Check systemd state on db2089 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:54] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:04] PROBLEM - Check systemd state on db1105 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:10] (03Merged) 10jenkins-bot: Add wgNamespaceAliases for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637870 (https://phabricator.wikimedia.org/T266925) (owner: 10Hamish) [12:09:12] jbond42: the db failures might be related, prometheus-mysqld-exporter [12:10:07] volans: my sync just finishes, is it safe to continue deploying? [12:10:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 11b8f6236d159962bdebccd6dcacb72e600ec6b5: Add wgNamespaceAliases for zhwikinews (T266925) (duration: 01m 06s) [12:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:25] T266925: Append wgNamespaceAliases for Chinese Wikinews - https://phabricator.wikimedia.org/T266925 [12:10:39] unexpected collect.lobal_status for the mysqld exporter restartzs [12:10:42] PROBLEM - Check systemd state on db2138 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:04] PROBLEM - Check systemd state on db2087 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:44] Urbanecm: AFAICT those errors are just related to the prometheus exporter so shouldn't be related to mediawiki deployments in any way [12:11:49] nor affect the DB themselves [12:11:52] okay, thanks. [12:11:53] (03PS2) 10Jbond: O:alerting_host: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639844 (https://phabricator.wikimedia.org/T266479) [12:12:04] did someone deploy a change to prometheus breaking the config? [12:12:10] volans: thanks ill take a look [12:12:22] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:27] seems like fallout of https://gerrit.wikimedia.org/r/639836 [12:12:38] jbond42: why merge with no review? [12:13:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26376" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639786 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [12:13:25] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=zhwikinews --fix --add-prefix=BROKEN # T266925 [12:13:29] (03PS1) 10Jcrespo: Revert "P:prometheus: migrate to debian::codename and ensure_packages" [puppet] - 10https://gerrit.wikimedia.org/r/640126 [12:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:36] (03CR) 10Jcrespo: [C: 03+2] Revert "P:prometheus: migrate to debian::codename and ensure_packages" [puppet] - 10https://gerrit.wikimedia.org/r/640126 (owner: 10Jcrespo) [12:14:50] PROBLEM - Check systemd state on db1150 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:30] PROBLEM - Check systemd state on db1146 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:36] PROBLEM - Check systemd state on db2137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:15:46] (03PS1) 10Jbond: Revert "P:prometheus: migrate to debian::codename and ensure_packages" [puppet] - 10https://gerrit.wikimedia.org/r/640127 [12:16:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:prometheus: migrate to debian::codename and ensure_packages" [puppet] - 10https://gerrit.wikimedia.org/r/640127 (owner: 10Jbond) [12:16:12] RECOVERY - Check systemd state on db2089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:54] PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:15] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "The PCC diff is fine for what I understand from https://phabricator.wikimedia.org/T266479" [puppet] - 10https://gerrit.wikimedia.org/r/639786 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [12:18:18] RECOVERY - Check systemd state on db1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:28] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:48] PROBLEM - Check systemd state on db1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:18] RECOVERY - Check systemd state on db2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:38] RECOVERY - Check systemd state on db1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:42] RECOVERY - Check systemd state on db2137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:52] RECOVERY - Check systemd state on db1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:56] RECOVERY - Check systemd state on db2138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:18] RECOVERY - Check systemd state on db2087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:30] RECOVERY - Check systemd state on db1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:32] RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:38] RECOVERY - Check systemd state on db1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:18] RECOVERY - Check systemd state on db1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:20] RECOVERY - Check systemd state on db1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:28:27] (03PS1) 10Jbond: P:prometheus: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640128 (https://phabricator.wikimedia.org/T266479) [12:28:38] (03CR) 10Effie Mouzeli: "After merging 638109, swift was sending a cp* ip in the X-Client-IP header. I am looking into the varnish part and see what can be done." [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [12:29:10] (03CR) 10Effie Mouzeli: [C: 04-1] "Until we make it work as we want, I will -1 it" [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [12:29:18] (03CR) 10Jbond: [C: 04-2] P:prometheus: migrate to debian::codename and ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640128 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [15:15:49] (03PS3) 10Muehlenhoff: profile::analytics::database::meta: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/640150 (https://phabricator.wikimedia.org/T267396) [15:17:04] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:43] 10Operations, 10Mail, 10Patch-For-Review: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230 (10Aklapper) Three years later, are there any plans to review or merge herron's three open patches in Gerrit? [15:21:43] (03PS3) 10Cwhite: hiera: add parameter to enable access from hosts and add grafana [puppet] - 10https://gerrit.wikimedia.org/r/639811 (https://phabricator.wikimedia.org/T267017) [15:23:07] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add parameter to enable access from hosts and add grafana [puppet] - 10https://gerrit.wikimedia.org/r/639811 (https://phabricator.wikimedia.org/T267017) (owner: 10Cwhite) [15:24:01] (03PS1) 10Hashar: Add metrics-reporter-prometheus plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640174 (https://phabricator.wikimedia.org/T184086) [15:25:10] (03PS2) 10Filippo Giunchedi: profile: redirect to grafana-rw with referer [puppet] - 10https://gerrit.wikimedia.org/r/640158 (https://phabricator.wikimedia.org/T262512) [15:28:27] (03CR) 10Muehlenhoff: [C: 03+2] profile::analytics::database::meta: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/640150 (https://phabricator.wikimedia.org/T267396) (owner: 10Muehlenhoff) [15:29:38] (03CR) 10Muehlenhoff: profile::envoy: Remove jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640092 (https://phabricator.wikimedia.org/T267396) (owner: 10Muehlenhoff) [15:29:41] (03PS2) 10Muehlenhoff: profile::envoy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/640092 (https://phabricator.wikimedia.org/T267396) [15:35:53] (03PS5) 10Effie Mouzeli: hieradata: enable ICU 63 in mwdebug2002 [puppet] - 10https://gerrit.wikimedia.org/r/639780 (https://phabricator.wikimedia.org/T264991) [15:39:52] (03PS3) 10Muehlenhoff: profile::envoy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/640092 (https://phabricator.wikimedia.org/T267396) [15:49:49] phabricator down it seems [15:50:35] D: [15:50:41] Nikerabbit: up here [15:50:42] working here [15:51:07] Might be a tad slower than unusual but it's up for sure [15:51:09] we just had a bunch of cannot connect to the database errors during triage meeting [15:51:21] And that might be me noticing slow because someone said down [15:51:21] and definitely been slower today [15:52:41] (03CR) 10Muehlenhoff: [C: 03+2] profile::envoy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/640092 (https://phabricator.wikimedia.org/T267396) (owner: 10Muehlenhoff) [15:53:28] Nikerabbit: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=8&orgId=1&refresh=5m&var-server=phab1001&var-datasource=thanos&var-cluster=misc&from=now-30m&to=now [15:53:34] That's a traffic drop [15:55:50] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member ge-3/0/13 { ... } + member ge-3/0/14; [edit interfaces interface-range disabled] - member ge-3/0... [16:03:28] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10Papaul) [16:04:05] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] cdh: replace os_version with debian::codename [puppet] - 10https://gerrit.wikimedia.org/r/639758 (https://phabricator.wikimedia.org/T267396) (owner: 10Jbond) [16:05:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] envoproxy: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639764 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:05:35] (03CR) 10Jbond: [C: 03+2] base:standard_packages: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639747 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:05:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add new java images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) (owner: 10Alexandros Kosiaris) [16:06:00] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add new java images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) (owner: 10Alexandros Kosiaris) [16:11:25] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10Papaul) [16:17:49] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [16:19:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:08] !log Netbox prod: mass import from PuppetDB (cables, etc) - T262899 [16:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:17] T262899: Import servers<->switches cables in eqiad & codfw - https://phabricator.wikimedia.org/T262899 [16:20:46] (03PS1) 10Hashar: Add metrics-reporter-jmx plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640177 (https://phabricator.wikimedia.org/T184086) [16:21:00] (03PS1) 10Jgreen: add frdb1004 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/640178 [16:22:43] (03PS1) 10Papaul: DNS: Add production DNS for deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/640180 [16:24:17] cscott: about? just trying to get a handle on status of T267370 [16:24:19] T267370: Use of FormatMetadata::formatNum with non-numeric value was deprecated in MediaWiki 1.36. [Called from FormatMetadata::makeFormattedData] - https://phabricator.wikimedia.org/T267370 [16:26:54] (03Abandoned) 10Jgreen: add frdb1004 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/640178 (owner: 10Jgreen) [16:27:36] (03CR) 10Jbond: [C: 03+2] "PCC shows noop https://puppet-compiler.wmflabs.org/compiler1001/26383/" [puppet] - 10https://gerrit.wikimedia.org/r/639819 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:28:23] (03Abandoned) 10Jgreen: flip payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/638155 (owner: 10Jgreen) [16:28:37] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog2002.codfw.wmnet - https://phabricator.wikimedia.org/T267272 (10Papaul) a:03Papaul [16:28:50] (03CR) 10Jgreen: [C: 03+2] Add frdb1004 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/639830 (https://phabricator.wikimedia.org/T265086) (owner: 10Dwisehaupt) [16:29:59] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog2002.codfw.wmnet - https://phabricator.wikimedia.org/T267272 (10Papaul) [16:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:12] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) 05Open→03Resolved [16:34:04] brennen: seems that patch was already deployed as https://gerrit.wikimedia.org/r/c/mediawiki/core/+/639505, and at https://phabricator.wikimedia.org/T267370#6607683, cscott says it only happens with images from a certain camera. I'd say it just needs someone verifying it at commons? [16:34:19] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session at mwmaint1002 (wiki=kowiki; T246539) [16:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:29] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [16:34:45] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session at mwmaint1002 (wiki=trwiki; T246539) [16:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:34] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/639824 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:37:29] 10Operations, 10ops-eqiad, 10Analytics-Radar: analytics1046/analytics1057 stuck in booting - https://phabricator.wikimedia.org/T267392 (10fdans) [16:37:45] ottomata: hey, are you joining o11y office hours ? [16:38:10] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/639785 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:40:35] !log imported 2.0.2+0.5.7-1~wmf3+php72+buster1 to component/php72 for buster-wikimedia [16:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:31] (03PS1) 10Jbond: P:prometheus: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640181 (https://phabricator.wikimedia.org/T266479) [16:43:33] (03CR) 10Hashar: [C: 04-1] "The metrics do show on the JavaMelody Mbeans tree , but they are not exposed in /monitoring?format=prometheus T184086#6613887" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640177 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [16:43:33] oh godog thanks for reminder [16:43:37] in analytics groomingbut coming now! [16:44:10] godog: can you give me link? [16:44:23] your combined team meeting and office hours cal event is hard to find [16:45:05] ok, from the look of things i think it's safe to roll train forward to group1. [16:45:14] going ahead with that shortly. [16:45:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:27] thanks brennen ! [16:45:37] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [16:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26384" [puppet] - 10https://gerrit.wikimedia.org/r/640181 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:47:42] (03PS4) 10Reedy: peek: make git::clone ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912) [16:48:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC is noop" [puppet] - 10https://gerrit.wikimedia.org/r/640181 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:48:46] (03CR) 10Urbanecm: [C: 04-1] "minor note inline, otherwise LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) (owner: 10Jberkel) [16:48:49] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:56] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:48:59] (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640183 [16:49:01] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640183 (owner: 10Brennen Bearnes) [16:49:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] peek: make git::clone ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912) (owner: 10Reedy) [16:49:12] PROBLEM - Unmerged changes on repository puppet on puppetmaster2003 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:49:16] Jeff_Green: you ok for me to merge your change [16:49:48] jbond42: which one? I thought I had abandoned the most recent one [16:49:54] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:49:57] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640183 (owner: 10Brennen Bearnes) [16:50:06] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:50:09] jeff there was a change to add host_name frdb1004 [16:50:34] https://gerrit.wikimedia.org/r/c/operations/puppet/+/639830 [16:50:40] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:50:44] jbond42: ok, sorry, that's a miscommunication, we'll take care of it [16:50:56] RECOVERY - Unmerged changes on repository puppet on puppetmaster2003 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:51:07] i have merged it now to clear the alert as it seemed harmless [16:51:12] do you want me to revert# [16:51:36] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:51:48] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:52:19] jbond42: nope that's good, I did the gerrit merge for Dallas and didn't realize he still can't do the puppet-merge [16:52:37] ahh ok np probs its deployed now [16:52:42] thanks! [16:52:53] thanks jbond42! [16:53:01] np :) [16:56:21] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.16 [16:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:07] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [16:57:09] (03PS2) 10Jberkel: Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) [16:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:27] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.16 (duration: 01m 05s) [16:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:16] (03PS3) 10Jberkel: Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) [17:04:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:45] 10Operations, 10homer, 10netops: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) [17:08:36] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/640180 (owner: 10Papaul) [17:08:54] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [17:14:01] (03CR) 10Jberkel: Enable "Cite" button in toolbar for enwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) (owner: 10Jberkel) [17:14:34] 10Operations, 10Performance-Team, 10Platform Engineering, 10serviceops: Phasing out MediaWiki Redis - https://phabricator.wikimedia.org/T267581 (10jijiki) [17:16:55] 10Operations, 10Performance-Team, 10Platform Engineering, 10serviceops: Phasing out MediaWiki Redis - https://phabricator.wikimedia.org/T267581 (10jijiki) [17:18:32] 10Operations, 10Performance-Team, 10Platform Engineering, 10serviceops: Phasing out MediaWiki Redis - https://phabricator.wikimedia.org/T267581 (10jijiki) [17:18:34] 10Operations, 10Platform Engineering, 10serviceops, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [17:18:39] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10jijiki) [17:18:54] (03PS1) 10Jbond: prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 [17:19:28] (03CR) 10jerkins-bot: [V: 04-1] prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [17:21:43] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) I then tried **`metrics-reporter-prometheus`**. The metrics are exposed under the plugin namespace: `/plugins/metrics-reporter-pr... [17:21:47] (03PS2) 10Jbond: prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 [17:22:20] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable ICU 63 in mwdebug2002 [puppet] - 10https://gerrit.wikimedia.org/r/639780 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [17:22:57] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) @Jclark-ctr if you can give me the network ports you intend to use I will have them pre-configured as well. [17:26:42] (03PS3) 10Jbond: prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 [17:30:28] (03CR) 10Hashar: "The Gerrit internal metrics are then exposed on https://gerrit.wikimedia.org/r/monitoring?part=mbeans which is helpful for administrators " [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640177 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [17:30:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26387" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [17:31:51] (03PS4) 10Jbond: prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 [17:32:05] 10Operations, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Orlodrim) HTML messages are still consistently blocked. Is there at least a way to get the log of the spam filter? I went through past messages and tried to fix another de... [17:32:50] (03CR) 10Hashar: "That will exposes Gerrit internal metrics at https://gerrit.wikimedia.org/r/plugins/metrics-reporter-prometheus/metrics . Once deployed, w" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640174 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [17:32:58] (03PS1) 10BBlack: Add Digicert 2020 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/640212 (https://phabricator.wikimedia.org/T261419) [17:33:53] !log updating mwdebug2002 to ICU 63 - T264991 [17:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:02] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [17:34:26] (03PS2) 10BBlack: Add Digicert 2020 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/640212 (https://phabricator.wikimedia.org/T261419) [17:35:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26388" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [17:37:52] (03PS5) 10Jbond: prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 [17:38:26] (03CR) 10jerkins-bot: [V: 04-1] prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [17:41:25] (03PS1) 10BBlack: configure digicert-2020 certificates [puppet] - 10https://gerrit.wikimedia.org/r/640213 (https://phabricator.wikimedia.org/T261419) [17:41:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26389" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [17:42:21] (03PS6) 10Jbond: prometheus::mysqld_exporter::instance: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/640210 [17:43:12] (03CR) 10BBlack: [C: 03+2] Add Digicert 2020 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/640212 (https://phabricator.wikimedia.org/T261419) (owner: 10BBlack) [17:43:45] (03CR) 10Jbond: [V: 03+1] "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [17:44:16] (03PS1) 10Andrew Bogott: Nova check_flavor_properties test: increase timeout again [puppet] - 10https://gerrit.wikimedia.org/r/640214 [17:44:27] (03Abandoned) 10Jbond: P:prometheus: migrate to debian::codename and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640128 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [17:44:42] (03CR) 10Andrew Bogott: [C: 03+2] Nova check_flavor_properties test: increase timeout again [puppet] - 10https://gerrit.wikimedia.org/r/640214 (owner: 10Andrew Bogott) [17:45:54] (03PS1) 10Hashar: prometheus: collect Gerrit internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) [17:46:05] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) a:03hashar [17:46:38] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) [17:47:32] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [17:47:55] (03CR) 10Hashar: [C: 04-1] "That requires a plugin on Gerrit side ( https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/640174 )" [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [17:49:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Traffic: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) p:05High→03Triage [17:49:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Traffic: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) p:05Triage→03High [17:50:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) [17:51:37] !log standardize asw-d-codfw interfaces descriptions [17:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:13] (03PS1) 10Jbond: debian::codename::requre: allow passing a custom message to require [puppet] - 10https://gerrit.wikimedia.org/r/640217 [17:53:30] !log re-order asw-d-codfw interfaces-ranges [17:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:22] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) >>! In T234854#6601662, @Krinkle wrote: >I think to the extent possible we should remain on Kibana 6 until and unless these are addressed by upstream Thanks for the f... [18:00:05] ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T1800). [18:02:27] (03PS1) 10BBlack: authdns: raise tcp_clients_per_thread to 4K [puppet] - 10https://gerrit.wikimedia.org/r/640219 (https://phabricator.wikimedia.org/T266746) [18:03:58] (03PS2) 10Jbond: debian::codename::requre: allow passing a custom message to require [puppet] - 10https://gerrit.wikimedia.org/r/640217 [18:04:00] (03PS1) 10Jbond: debian::codename::require: update code to use new debian:: functions [puppet] - 10https://gerrit.wikimedia.org/r/640221 (https://phabricator.wikimedia.org/T266479) [18:05:28] (03CR) 10jerkins-bot: [V: 04-1] debian::codename::require: update code to use new debian:: functions [puppet] - 10https://gerrit.wikimedia.org/r/640221 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [18:05:52] (03CR) 10BBlack: [C: 03+2] authdns: raise tcp_clients_per_thread to 4K [puppet] - 10https://gerrit.wikimedia.org/r/640219 (https://phabricator.wikimedia.org/T266746) (owner: 10BBlack) [18:07:36] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jgiannelos - https://phabricator.wikimedia.org/T267585 (10Jgiannelos) [18:15:05] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jgiannelos - https://phabricator.wikimedia.org/T267585 (10dcipoletti) In case of a manager approval: Once required steps are executed upon, I approve of @Jgiannelos request for access to the `deployment` group. [18:22:06] 10Operations, 10DBA, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) We run the following on mwdebug2002: ` # 'export DEBIAN_FRONTEND=noninteractive; apt-get -y libicu63 libxml2' # 'export DEBIAN_FRONTEND=noninteractive; apt-get install php7.2-bcm... [18:25:06] (03PS2) 10Jbond: debian::codename::require: update code to use new debian:: functions [puppet] - 10https://gerrit.wikimedia.org/r/640221 (https://phabricator.wikimedia.org/T266479) [18:27:02] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/640221 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [18:27:33] 10Operations, 10DBA, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) That's not enough, these are just the binary packages from the php72 source package, but you also need to upgrade php-apcu, php-cli, php-common, php-excimer, php-geoip,... [18:27:35] (03CR) 10Jbond: debian::codename::require: update code to use new debian:: functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640221 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [18:30:54] 10Operations, 10DBA, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) done [18:40:22] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10jijiki) @Papaul That is fine, thank you! [18:40:33] (03PS1) 10Dave Pifke: [WIP] coal: Use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/640226 (https://phabricator.wikimedia.org/T267269) [18:55:09] (03CR) 10Jcrespo: "Looking solid, but Stevie should give it a deeper look, as I believe is the person mostly working on puppet right now. Feel free to block/" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [19:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:01:04] "My dear minions, it's time we take the moon!" - despicable me reference? [19:04:18] (03PS1) 10MSantos: wikifeeds: bump to 2020-11-09-185007-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/640230 [19:08:56] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2020-11-09-185007-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/640230 (owner: 10MSantos) [19:11:28] (03Merged) 10jenkins-bot: wikifeeds: bump to 2020-11-09-185007-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/640230 (owner: 10MSantos) [19:21:10] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [19:23:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [19:38:10] (03PS2) 10Hashar: prometheus: collect Gerrit internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) [19:39:05] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [19:41:11] 10Operations, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10Gilles) [19:41:19] 10Operations, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10Krinkle) [19:42:45] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [19:44:09] 10Operations, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10Krinkle) Clarified title and summary based on chat with Effie on IRC. The two remaining uses for a simpler non-replicated are Chrono... [19:58:07] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [19:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:04] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [20:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:12] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [20:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:33] noting here that, as per mail and T263182, i'm rolling the train to group2 now. [20:04:33] T263182: 1.36.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T263182 [20:08:12] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640235 [20:08:14] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640235 (owner: 10Brennen Bearnes) [20:09:03] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640235 (owner: 10Brennen Bearnes) [20:09:36] (03CR) 10Hashar: "Long overdue, that is to add an extra Prometheus scraper for Gerrit, though we need a plugin to be deployed on Gerrit first." [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [20:11:26] (03CR) 10Eileen: [C: 03+1] "I was unable to use docker-pkg on my new laptop without this patch" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [20:12:15] (03CR) 10Eileen: [C: 03+1] "> Patch Set 1: Code-Review+1" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [20:12:24] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.16 [20:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:57] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@0a38bc5]: Add new target for beta environment and clean-up old envs (T223041 T222377 T255932) [20:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:12] T223041: Install a robots.txt file for maps.wikimedia.org - https://phabricator.wikimedia.org/T223041 [20:13:13] T222377: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 [20:13:13] T255932: Automatic zoom and positioning doesn't work for geomasks in thumbnails - https://phabricator.wikimedia.org/T255932 [20:20:35] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:21:07] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:21:14] ^ maps deployment failed and rollback is failing as well I might need assistance [20:21:47] PROBLEM - Maps HTTPS on maps1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:21:49] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:22:39] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian_6533: Servers maps1001.eqiad.wmnet, maps1004.eqiad.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:23:13] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.13 and port 6533: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:23:13] PROBLEM - Maps edge eqiad on upload-lb.eqiad.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) is CRITICAL: Test private tile service info for osm-intl returned the unexpected status 502 (expecting: 400) https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:23:25] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian_6533: Servers maps1001.eqiad.wmnet, maps1004.eqiad.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:23:47] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([maps1003.eqiad.wmnet, maps1001.eqiad.wmnet, maps1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [20:23:50] mateusbs17: rollback is failing? how can we help? [20:24:21] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:24:21] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([maps1003.eqiad.wmnet, maps1001.eqiad.wmnet, maps1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [20:24:29] cdanis I'm trying to understand why it's failing [20:24:33] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@0a38bc5]: Add new target for beta environment and clean-up old envs (T223041 T222377 T255932) (duration: 11m 36s) [20:24:39] PROBLEM - Maps HTTPS on maps1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:47] T223041: Install a robots.txt file for maps.wikimedia.org - https://phabricator.wikimedia.org/T223041 [20:24:48] T222377: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 [20:24:48] T255932: Automatic zoom and positioning doesn't work for geomasks in thumbnails - https://phabricator.wikimedia.org/T255932 [20:24:57] PROBLEM - Maps edge esams on upload-lb.esams.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) is CRITICAL: Test private tile service info for osm-intl returned the unexpected status 502 (expecting: 400) https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:25:01] First deployment failed for only one instance [20:25:13] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@0a38bc5]: Add new target for beta environment and clean-up old envs (T223041 T222377 T255932) [20:25:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_kartotherian_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:01] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian_6533: Servers maps2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:26:03] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:26:17] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:26:22] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@0a38bc5]: Add new target for beta environment and clean-up old envs (T223041 T222377 T255932) (duration: 01m 09s) [20:26:23] RECOVERY - Maps HTTPS on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.365 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:33] mateusbs17: which one failed? [20:26:36] mateusbs17: looks like one of the problems is that maps2002 has a full disk [20:26:39] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:26:41] RECOVERY - Maps edge eqiad on upload-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:26:43] RECOVERY - Maps edge esams on upload-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:26:46] ah! [20:26:52] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:26:57] RECOVERY - Maps HTTPS on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:27:01] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 1.909 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:27:13] bblack: cdanis yep, that's the only one failing [20:27:27] I know hnowlan was working on it but don't remember what the status was there [20:27:29] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:27:42] Maybe we should depool it while I investigate? [20:27:45] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:27:49] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:27:52] it's /var/lib/cassandra, looking [20:27:56] but yeah, maybe depool it [20:28:16] The last Puppet run was at Thu Nov 5 16:57:04 UTC 2020 (5971 minutes ago). Puppet is disabled. hnowlan - rebuilding cassandra [20:28:25] bunch of these in there: [20:28:27] -rw------- 1 cassandra cassandra 9853679587 Nov 3 07:15 java_pid22166.hprof [20:28:30] -rw------- 1 cassandra cassandra 10004925809 Nov 4 08:13 java_pid32371.hprof [20:28:36] (multi-gigabyte debug files) [20:28:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:57] the /srv directory is also completely full [20:29:10] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [20:29:31] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:30:03] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:31:05] FYI, hnowlan is on vacation this week but did say he'd be available by phone/text if anything went wrong with the cassandra work he was doing there -- seems like this qualifies, if you need any information from him [20:31:47] some of the other maps servers have ~75/80% fullness on /srv too. maps1004 has a 60% full root disk as an outlier as well [20:32:25] that 75-80% fullness on /srv may be the current "normal" [20:32:32] yes [20:32:46] I don't like calling someone while they're on vacation, so if depooling and leaving it is an option, I'd lean toward that, but a week may be a long time -- no strong feelings from me [20:33:20] I vote for also deleting at least a few of those hprof files so the system can regain basic health [20:34:24] bblack: rzl cdanis, the PG DB could also shrink up to 100GB with reindex [20:35:53] once the machine is depooled I can act on it if you want to [20:36:17] !log depool maps2002 [20:36:19] yeah but those java files are clearly some kind of recent profile/debug outputs, and they're huge and probably what's really sending that machine over the edge [20:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:57] any objection to me deleting one of the ~14 or so java hprof files from /var/lib/cassandra? it will free up ~9G and let puppet runs resume, etc [20:38:00] I'm removing a the largest [20:38:09] cdanis: ok go ahead :) [20:38:20] Puppet is still explicitly disabled though [20:38:32] hmmm [20:38:50] like, hnowlan turned it off, and I'm not sure what will happen wrt: Cassandra if we turn it back on [20:39:21] yeah [20:40:05] (03Abandoned) 10HitomiAkane: Adding 'eliminator' group to arz.wikipedia and grant permissions to add/remove an 'eliminator' right to the local sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640089 (https://phabricator.wikimedia.org/T267286) (owner: 10HitomiAkane) [20:41:39] (03PS1) 10Thcipriani: Phabricator: block agressive crawler [puppet] - 10https://gerrit.wikimedia.org/r/640244 (https://phabricator.wikimedia.org/T267603) [20:43:45] I guess we should at least create a reference ticket and downtime the host or something [20:43:56] so someone else doesn't stumble on this tomorrow or whatever [20:44:31] (03CR) 10Jcrespo: [C: 03+2] Phabricator: block agressive crawler [puppet] - 10https://gerrit.wikimedia.org/r/640244 (https://phabricator.wikimedia.org/T267603) (owner: 10Thcipriani) [20:44:38] oh it's already in downtime [20:45:12] (03PS2) 10Volans: Refactoring: rename internal modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/634056 (https://phabricator.wikimedia.org/T221212) [20:45:14] (03PS11) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) [20:45:45] maybe all that was really missing was the depool, then [20:45:48] yeah, it *had* been pooled [20:45:55] and in fact SAL shows it being depooled [20:45:57] but not being repooled [20:46:00] so, I don't know what happened there [20:46:10] a host without puppet running and with notifications downtimed shouldn't be repooled [20:46:16] perhaps the deployment system repooled it, because it contains some unchecked "depool; deploy; repool" logic? [20:46:16] bblack: I tried to log in the PG DB, but it seems that it's not starting up because of the disk space [20:46:56] (03CR) 10Hashar: [C: 03+1] "Got broken by https://salsa.debian.org/python-debian-team/python-debian/-/commit/da639a2b3dfe0086929d1934d330f34581b3064a which has been " [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [20:48:45] (03CR) 10Volans: "Replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [20:49:53] 696633M cassandra [20:49:55] 920453M postgresql [20:49:57] not... sure what's in there [20:50:37] either way the downtime claims the host is due to be decommed [20:50:53] so we don't necessarily have to fix the host, we just have to understand how it got repooled, I think [20:51:09] theoretically PG should have up to 800GB after reindex [20:51:34] I'm not sure though if this should be applied to a replica, so probably it's better to leave it [20:52:57] so I guess the next question is if a depool is sufficient to get the host out of scap's (dsh?) list of deployment targets [20:53:01] oh [20:53:20] !log cdanis@cumin1001 conftool action : set/pooled=inactive; selector: name=maps2002.* [20:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:25] I believe that will do that. [20:54:12] deploy did hit the host earlier [20:54:19] Nov 9 20:15:25 maps2002 systemd[1]: Created slice User Slice of deploy-service. [20:54:37] but I guess that would be normal for pooled=no, and still probably shouldn't cause a pooled=yes [20:54:40] ? [20:54:48] yeah it should copy code to a pooled=no [20:55:31] as for the thought that it might have been scap that did the repooling, I don't think so -- the only mechanism it has for that is Mediawiki/PHP-specific, and said mechanism is careful to respect preexisting depooledness [20:56:06] ✔️ cdanis@deploy1001.eqiad.wmnet ~ 🕓🍵 cat /etc/dsh/group/maps [20:56:09] no longer shows maps2002 [20:56:12] 👍 [20:56:19] 10Operations, 10serviceops: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10jijiki) [20:56:46] 10Operations, 10serviceops: upgrade mwmaint1002 to buster - https://phabricator.wikimedia.org/T267607 (10jijiki) [20:59:46] SAL history: [20:59:48] 17:14 hnowlan: rebuilding cassandra on maps2002 [20:59:56] 17:15 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: dc=codfw,cluster=maps,service=kartotherian,name=maps2002.codfw.wmnet [21:00:03] 20:39 hnowlan: finished removenode of maps2002 cassandra [21:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T2100). [21:00:09] ^ this was all back on Nov 5th [21:00:59] but something changed =no to =yes I think, probably during the deploy, causing the healthcheck spam we saw (the actual deployed changes probably weren't the problem). [21:01:39] maybe because of the rollback? [21:01:43] maybe I'm wrong though. there were health failure reports on other nodes, but I don't know all the secret turnings of how the nodes interact with each other [21:02:17] the failure reports happened after the rollback [21:02:28] hmmm [21:02:39] during deployment, only maps2002 was failing [21:03:05] I had to go on with the deployment to get the machines back to normal [21:03:42] (03PS1) 10Jcrespo: Revert "Phabricator: block agressive crawler" [puppet] - 10https://gerrit.wikimedia.org/r/640192 [21:04:54] jouncebot: next [21:04:54] In 0 hour(s) and 55 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T2200) [21:07:14] (03CR) 10Jcrespo: [C: 03+2] Revert "Phabricator: block agressive crawler" [puppet] - 10https://gerrit.wikimedia.org/r/640192 (owner: 10Jcrespo) [21:07:52] bblack: what's next now for maps? [21:08:01] mateusbs17: I think you should be good to continue [21:08:12] maps2002 is depooled, and, scap won't try to deploy anything there or touch the host [21:08:23] I think we can sort out what's going on with that next week [21:08:50] cdanis: cool, thanks [21:11:45] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@97575e4]: Add new target for beta environment and clean-up old envs (T222377) [21:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:55] T222377: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 [21:14:08] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@97575e4]: Add new target for beta environment and clean-up old envs (T222377) (duration: 02m 23s) [21:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:02] 10Operations, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) >>! In T267581#6614385, @Krinkle wrote: > Clarified title and summary based on chat with Effie on IRC. > > The two remaining... [21:52:37] (03PS1) 10C. Scott Ananian: Turn on formatnum logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) [21:53:25] (03PS1) 10CDanis: add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 [21:54:25] (03PS1) 10CDanis: Add httpbb tests for apple-app-site-association magic URL [puppet] - 10https://gerrit.wikimedia.org/r/640257 (https://phabricator.wikimedia.org/T259312) [21:54:35] (03PS2) 10C. Scott Ananian: Turn on formatnum logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) [21:54:38] (03PS1) 10Papaul: DHCP: Add MAC address for deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/640258 (https://phabricator.wikimedia.org/T266363) [21:55:50] (03CR) 10jerkins-bot: [V: 04-1] add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 (owner: 10CDanis) [21:55:57] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/640258 (https://phabricator.wikimedia.org/T266363) (owner: 10Papaul) [21:56:22] (03PS2) 10CDanis: add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 [21:58:03] (03PS1) 10Razzi: oozie: Use admin groups for permissions [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) [21:58:57] ejegg: so it turns out my ability to do useful work today was somewhat compromised by too many hours of meetings... is it okay if the patches for T259312 don't go out until next week? [21:58:58] T259312: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 [22:00:04] cdanis sure, that's fine. I need to put up a revision to the operations puppet patch anyway [22:00:04] Reedy and sbassett: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201109T2200). [22:00:22] cool, thanks :) [22:01:19] (03CR) 10Bstorm: [C: 03+2] [apt::conf] Allow passing integers as value [puppet] - 10https://gerrit.wikimedia.org/r/639778 (owner: 10David Caro) [22:02:30] (03PS1) 10Papaul: Add deploy2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/640261 (https://phabricator.wikimedia.org/T266363) [22:03:19] (03CR) 10Papaul: [C: 03+2] Add deploy2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/640261 (https://phabricator.wikimedia.org/T266363) (owner: 10Papaul) [22:06:56] (03PS2) 10CDanis: Add httpbb tests for apple-app-site-association magic URL [puppet] - 10https://gerrit.wikimedia.org/r/640257 (https://phabricator.wikimedia.org/T259312) [22:08:52] (03PS1) 10Thcipriani: Phabricator: block agressive crawler via X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/640262 (https://phabricator.wikimedia.org/T267603) [22:09:04] (03CR) 10Subramanya Sastry: Turn on formatnum logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) (owner: 10C. Scott Ananian) [22:09:36] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` deploy2002.codfw.wmnet ` The log can be found in `/va... [22:11:48] (03CR) 10RLazarus: [C: 03+2] Phabricator: block agressive crawler via X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/640262 (https://phabricator.wikimedia.org/T267603) (owner: 10Thcipriani) [22:13:52] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26392" [puppet] - 10https://gerrit.wikimedia.org/r/638109 (https://phabricator.wikimedia.org/T266155) (owner: 10Effie Mouzeli) [22:14:34] (03CR) 10RLazarus: [C: 03+1] Add httpbb tests for apple-app-site-association magic URL [puppet] - 10https://gerrit.wikimedia.org/r/640257 (https://phabricator.wikimedia.org/T259312) (owner: 10CDanis) [22:22:35] (03CR) 10Razzi: "https://puppet-compiler.wmflabs.org/compiler1002/26391/" [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [22:23:29] (03PS3) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 [22:24:14] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:06] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:26] (03CR) 10C. Scott Ananian: Turn on formatnum logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) (owner: 10C. Scott Ananian) [22:30:29] (03PS3) 10C. Scott Ananian: Turn on formatnum logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) [22:32:21] (03CR) 10Subramanya Sastry: Turn on formatnum logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) (owner: 10C. Scott Ananian) [22:32:41] (03CR) 10Subramanya Sastry: [C: 03+1] Turn on formatnum logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) (owner: 10C. Scott Ananian) [22:33:27] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/640215 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [22:34:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['deploy2002.codfw.wmnet'] ` and were **ALL** successful. [22:34:39] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10Papaul) [22:35:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10Papaul) 05Open→03Resolved complete [22:39:51] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [22:39:53] (03PS1) 10Effie Mouzeli: pcc: add '-N' flag to avoid posting PCC result to jenkins [puppet] - 10https://gerrit.wikimedia.org/r/640265 [22:40:16] (03CR) 10jerkins-bot: [V: 04-1] pcc: add '-N' flag to avoid posting PCC result to jenkins [puppet] - 10https://gerrit.wikimedia.org/r/640265 (owner: 10Effie Mouzeli) [22:40:18] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [22:41:37] (03PS2) 10Effie Mouzeli: pcc: add '-N' flag to avoid posting PCC result to jenkins [puppet] - 10https://gerrit.wikimedia.org/r/640265 [22:53:48] (03CR) 10Ppchelko: Turn on formatnum logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) (owner: 10C. Scott Ananian) [23:08:29] (03CR) 10DannyS712: Turn on formatnum logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640254 (https://phabricator.wikimedia.org/T267587) (owner: 10C. Scott Ananian) [23:11:43] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) @ayounsi I am working on setting those to servers . 1 is in row C and the other one is in row D. We have the ` cloud-hosts1-b-codfw { vlan-id 2118; } ` in b... [23:14:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Stop using deleted fn Changelog.get_version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [23:14:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Many thanks! Merging" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [23:16:13] (03Merged) 10jenkins-bot: Stop using deleted fn Changelog.get_version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [23:22:07] PROBLEM - Long running screen/tmux on mw2221 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 12208, 1738419s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [23:23:57] (03PS1) 10Alexandros Kosiaris: Release 2.1.0 version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/640268 [23:24:55] (03CR) 10Alexandros Kosiaris: "/me just releasing a new version after a year gathering the 3 commits that were contributed since then" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/640268 (owner: 10Alexandros Kosiaris) [23:37:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:43:31] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 38 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:48:26] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10wiki_willy) a:03Papaul [23:49:33] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10wiki_willy) a:03Papaul [23:58:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down