[03:51:46] (03PS2) 10KartikMistry: Update cxserver to 2020-05-22-083137-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/597999 (https://phabricator.wikimedia.org/T246317) [03:55:23] (03PS3) 10KartikMistry: Update cxserver to 2020-05-22-083137-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/597999 (https://phabricator.wikimedia.org/T246317) [03:55:51] (03PS1) 10Aaron Schulz: arclamp: add svgs for some key entrypoint/singleton methods calls [puppet] - 10https://gerrit.wikimedia.org/r/598292 [04:00:39] * kart_ updating cxserver.. [04:00:50] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-05-22-083137-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/597999 (https://phabricator.wikimedia.org/T246317) (owner: 10KartikMistry) [04:01:11] (03Merged) 10jenkins-bot: Update cxserver to 2020-05-22-083137-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/597999 (https://phabricator.wikimedia.org/T246317) (owner: 10KartikMistry) [04:02:34] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [04:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:17] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [04:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:58] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [04:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:21] !log Updated cxserver to 2020-05-22-083137-production (T246317, T252871) [04:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:26] T252871: Add awawiki to cxserver - https://phabricator.wikimedia.org/T252871 [04:11:26] T246317: Generate template parameter alignments for the selected small wikis II - https://phabricator.wikimedia.org/T246317 [04:40:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:16] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:45:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:46] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:53:07] (03PS1) 10Marostegui: dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/598294 (https://phabricator.wikimedia.org/T249188) [04:53:35] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/598294 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [04:54:00] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:54:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:54:54] !log Depool labsdb1011 - T249188 [04:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:58] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:55:50] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:56:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:15] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Joe) >>! In T244340#6135546, @Krinkle wrote: >>>! In T244340#5853430, @jijiki wrote: >> The idea i... [05:11:39] !log Deploy schema change on s6, directly on the master - T253342 [05:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:42] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [05:14:00] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:14:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:34:08] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:34:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:45:04] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:45:36] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:55:10] (03PS1) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 [05:57:13] the OSPF status seems to be related to the Zayo transport link [05:58:08] ah nice there is an emergency maintenance [05:58:26] that is not scheduled in the ops-calendar [05:59:28] going to ack it [06:01:18] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1007 is CRITICAL: 7.39e+04 ge 4.32e+04 Elukey Host depooled, blazegraph restarted, waiting for the lag to catch up https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:01:21] cc: gehel: --^ [06:05:30] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:52] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:25:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:28:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:28:44] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:29:03] this is again zayo's maintenance --^ [06:29:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:50] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:42:34] <_joe_> yeah it's been ongoing for quite a bit elukey [06:45:02] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:45:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:11] elukey: thanks! [06:50:32] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:51:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/598038 (owner: 10Elukey) [06:56:00] (03PS1) 10Marostegui: check_private_data_report: Add Stephen to the list of mails [puppet] - 10https://gerrit.wikimedia.org/r/598410 [06:58:14] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Add Stephen to the list of mails [puppet] - 10https://gerrit.wikimedia.org/r/598410 (owner: 10Marostegui) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200525T0700) [07:02:19] !log Stop event scheduler on tendril T252331 [07:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:23] T252331: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 [07:05:19] (03CR) 10Elukey: [C: 03+2] alternatives: add a class for the java use case [puppet] - 10https://gerrit.wikimedia.org/r/598038 (owner: 10Elukey) [07:05:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:00] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:49] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [07:11:35] (03PS1) 10Muehlenhoff: Extend access for PHPBB people by one month [puppet] - 10https://gerrit.wikimedia.org/r/598411 [07:14:00] (03PS2) 10Muehlenhoff: Extend access for thephp.cc people by one month [puppet] - 10https://gerrit.wikimedia.org/r/598411 [07:14:18] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:39] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for thephp.cc people by one month [puppet] - 10https://gerrit.wikimedia.org/r/598411 (owner: 10Muehlenhoff) [07:17:56] (03PS12) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [07:21:40] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Some (recent?) uploads to Commons are not available on other wikis - https://phabricator.wikimedia.org/T253405 (10Joe) p:05High→03Medium I've been monitoring the status of new images in the following way: `lang=... [07:25:10] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:25:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:19] (03PS3) 10JMeybohm: Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 [07:36:25] !log installed linux-imageamd64 on labstore (current meta package for kernels following the Stretch update) T224582 [07:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:29] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [07:36:39] !log installed linux-image-amd64 on labstore1005 (current meta package for kernels following the Stretch update) T224582 [07:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:18] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10MoritzMuehlenhoff) One note for labstore1004: The meta package changed between jessie and stretch: Jessie by default has 3.... [07:44:20] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:56:13] (03PS4) 10JMeybohm: restrouter: Remove k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/573257 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [07:58:32] (03Abandoned) 10JMeybohm: restrouter: Remove k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/573257 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [08:01:28] (03PS2) 10JMeybohm: termbox: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597035 (https://phabricator.wikimedia.org/T235411) [08:02:42] (03CR) 10JMeybohm: [C: 03+2] termbox: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597035 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:03:03] (03Merged) 10jenkins-bot: termbox: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597035 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:03:20] (03PS1) 10Filippo Giunchedi: wmnet: add thanos-query svc address [dns] - 10https://gerrit.wikimedia.org/r/598412 (https://phabricator.wikimedia.org/T252186) [08:04:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 59.18 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:05:41] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' . [08:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [08:05:59] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: add thanos-query svc address [dns] - 10https://gerrit.wikimedia.org/r/598412 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:11:30] (03PS1) 10Filippo Giunchedi: conftool-data: add thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598413 (https://phabricator.wikimedia.org/T252186) [08:11:32] (03PS1) 10Filippo Giunchedi: hieradata: setup service::catalog entry for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598414 (https://phabricator.wikimedia.org/T252186) [08:15:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:15:34] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598413 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:17:14] (03PS4) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: refresh code for modern puppet [puppet] - 10https://gerrit.wikimedia.org/r/597805 (https://phabricator.wikimedia.org/T97972) [08:17:16] (03PS14) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: add additional users for pools [puppet] - 10https://gerrit.wikimedia.org/r/597806 (https://phabricator.wikimedia.org/T97972) [08:17:18] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [08:18:16] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:29] !log filippo@cumin1001 conftool action : set/pooled=yes:weight=100; selector: service=thanos-swift [08:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:29] (03PS1) 10Gilles: Fix Python 3 compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/598416 (https://phabricator.wikimedia.org/T253375) [08:23:52] (03CR) 10jerkins-bot: [V: 04-1] Fix Python 3 compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/598416 (https://phabricator.wikimedia.org/T253375) (owner: 10Gilles) [08:25:21] (03PS1) 10Filippo Giunchedi: hieradata: add thanos-swift to thanos frontend pools [puppet] - 10https://gerrit.wikimedia.org/r/598417 (https://phabricator.wikimedia.org/T252186) [08:29:36] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [08:39:27] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' . [08:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:20] (03CR) 10Gilles: "Any objections to merging this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [08:47:51] (03PS2) 10JMeybohm: zotero: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597036 (https://phabricator.wikimedia.org/T235411) [08:48:26] 10Operations: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [08:49:59] (03CR) 10JMeybohm: [C: 03+2] zotero: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597036 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:50:23] (03Merged) 10jenkins-bot: zotero: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597036 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:50:45] (03PS1) 10Giuseppe Lavagetto: jobrunner: switch mw1337 to envoy [puppet] - 10https://gerrit.wikimedia.org/r/598418 (https://phabricator.wikimedia.org/T247389) [08:51:39] (03CR) 10Vgutierrez: [C: 03+1] icinga: add --sni to check_http --ssl invocations [puppet] - 10https://gerrit.wikimedia.org/r/597765 (https://phabricator.wikimedia.org/T253292) (owner: 10Filippo Giunchedi) [08:52:23] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' . [08:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:25] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-1wm10 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/597552 (owner: 10Vgutierrez) [08:58:37] (03PS1) 10Ema: Move consumer/producer JSON files to fixture directory [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598420 (https://phabricator.wikimedia.org/T253197) [08:58:39] (03PS1) 10Ema: Test Update with actual producer/consumer stats [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598421 (https://phabricator.wikimedia.org/T253197) [08:58:41] (03PS1) 10Ema: Fix panic on prometheus.GaugeVec label cardinality mismatch [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598422 (https://phabricator.wikimedia.org/T253197) [08:59:25] (03CR) 10Ema: [V: 03+2 C: 03+2] Move consumer/producer JSON files to fixture directory [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598420 (https://phabricator.wikimedia.org/T253197) (owner: 10Ema) [08:59:49] (03CR) 10jerkins-bot: [V: 04-1] Fix panic on prometheus.GaugeVec label cardinality mismatch [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598422 (https://phabricator.wikimedia.org/T253197) (owner: 10Ema) [08:59:59] (03CR) 10Ema: [V: 03+2 C: 03+2] Test Update with actual producer/consumer stats [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598421 (https://phabricator.wikimedia.org/T253197) (owner: 10Ema) [09:00:09] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add --sni to check_http --ssl invocations [puppet] - 10https://gerrit.wikimedia.org/r/597765 (https://phabricator.wikimedia.org/T253292) (owner: 10Filippo Giunchedi) [09:00:17] (03PS3) 10Filippo Giunchedi: icinga: add --sni to check_http --ssl invocations [puppet] - 10https://gerrit.wikimedia.org/r/597765 (https://phabricator.wikimedia.org/T253292) [09:03:04] (03CR) 10Muehlenhoff: "One comment inline, looks good to me" (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [09:04:15] !log turn on sni by default for check_http --ssl icinga invocations - T253292 [09:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:18] T253292: check_http and SNI support - https://phabricator.wikimedia.org/T253292 [09:07:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: switch mw1337 to envoy [puppet] - 10https://gerrit.wikimedia.org/r/598418 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [09:08:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:28] (03CR) 10Lucas Werkmeister (WMDE): "Update: we have confirmation that cawiki will be the first real wiki, but no announced deployment date yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [09:10:32] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [09:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:23] (03PS6) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) [09:11:47] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add debian/ directory to the build overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [09:12:02] (03CR) 10Jbond: docker build: update the build process to us docker (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [09:12:55] 10Operations, 10Traffic: check_http and SNI support - https://phabricator.wikimedia.org/T253292 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Change deployed, resolving [09:14:18] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: setup service::catalog entry for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598414 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:15:36] (03PS1) 10Ema: Add minimal.json fixture [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598426 [09:16:24] (03CR) 10Ema: [V: 03+2 C: 03+2] Add minimal.json fixture [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598426 (owner: 10Ema) [09:16:58] (03PS2) 10Filippo Giunchedi: hieradata: add thanos-swift to thanos frontend pools [puppet] - 10https://gerrit.wikimedia.org/r/598417 (https://phabricator.wikimedia.org/T252186) [09:17:17] <_joe_> !log migrated mw1337 to use envoy for TLS termination T247389 [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] T247389: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 [09:17:37] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add thanos-swift to thanos frontend pools [puppet] - 10https://gerrit.wikimedia.org/r/598417 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:22:39] (03PS1) 10Filippo Giunchedi: hieradata: set thanos-swift to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/598428 (https://phabricator.wikimedia.org/T252186) [09:24:05] <_joe_> jouncebot: next [09:24:05] In 25 hour(s) and 35 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1100) [09:24:12] <_joe_> ok, good [09:24:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:05] (03PS2) 10Giuseppe Lavagetto: appserver: use envoy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/597244 (https://phabricator.wikimedia.org/T247389) [09:34:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:34:41] (03PS1) 10Ema: Set DH_GOLANG_INSTALL_ALL [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598430 [09:35:48] (03CR) 10jerkins-bot: [V: 04-1] Set DH_GOLANG_INSTALL_ALL [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598430 (owner: 10Ema) [09:36:17] (03CR) 10Ema: [V: 03+2 C: 03+2] "This failed the right way." [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598430 (owner: 10Ema) [09:36:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:37:12] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set thanos-swift to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/598428 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:38:19] (03PS2) 10Ema: Fix panic on prometheus.GaugeVec label cardinality mismatch [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598422 (https://phabricator.wikimedia.org/T253197) [09:39:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22717/ says this DTRT." [puppet] - 10https://gerrit.wikimedia.org/r/597244 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [09:42:50] <_joe_> !log converting mw1319-1333 to use envoy for TLS termination [09:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:09] !log upload trafficserver 8.0.7-1wm10 to apt.wm.o (buster) [09:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:54] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.54:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:46:10] that's me ^ known [09:49:24] <_joe_> !log depooled mw1337, it was getting all traffic supposed to go to the jobrunners [09:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:50] <_joe_> I will have to reduce the duration of persistent connections there [09:53:17] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2003.codfw.wmnet, thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:56:32] <_joe_> !log transition done [09:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:49] (03CR) 10Volans: "one comment inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [10:00:19] <_joe_> volans: yeah sorry I just fast-pushed what I left over on friday as I had to work on the envoy conversion [10:00:34] np :) [10:00:59] (03PS1) 10Jbond: package_builder: add testing sources to the build hosts [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) [10:02:30] (03PS1) 10Filippo Giunchedi: Revert "hieradata: set thanos-swift to lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/598434 (https://phabricator.wikimedia.org/T252186) [10:02:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "hieradata: set thanos-swift to lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/598434 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [10:03:00] (03CR) 10Jbond: "PCC https://puppet-compiler.wmflabs.org/compiler1001/22718/deneb.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:03:53] (03CR) 10Filippo Giunchedi: [C: 03+2] "jenkins failures are re: commit message length" [puppet] - 10https://gerrit.wikimedia.org/r/598434 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [10:04:01] (03PS1) 10JMeybohm: admin: Increase maximum Pod memory to 3Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/598435 (https://phabricator.wikimedia.org/T235411) [10:04:16] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "hieradata: set thanos-swift to lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/598434 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [10:04:49] (03CR) 10Volans: "nits inline" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/592246 (owner: 10Ayounsi) [10:08:56] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:09:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/598153 (owner: 10Volans) [10:11:50] 10Operations, 10Pybal, 10Traffic: PyBal ProxyFetch failure when talking to Envoy in SNI-only mode - https://phabricator.wikimedia.org/T253527 (10fgiunchedi) [10:12:14] <_joe_> godog: hah! [10:12:28] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:12:58] (03CR) 10Muehlenhoff: package_builder: add testing sources to the build hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:13:35] _joe_: indeed, deploying SNI-only is full of rabbith^Wsurprises [10:15:37] (03PS2) 10Jbond: package_builder: add unstable sources to the build hosts [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) [10:15:50] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:18:36] (03CR) 10Muehlenhoff: "Looks good, one final bit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:20:02] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:38] (03PS1) 10Muehlenhoff: Add proxy configuration for the build overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/598437 (https://phabricator.wikimedia.org/T233947) [10:24:25] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598439 (https://phabricator.wikimedia.org/T128546) [10:24:49] (03PS2) 10Muehlenhoff: Add proxy configuration for the build overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/598437 (https://phabricator.wikimedia.org/T233947) [10:26:43] (03CR) 10Jbond: "Looks good, some minor comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:28:52] (03PS3) 10Jbond: package_builder: add unstable sources to the build hosts [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) [10:28:54] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:30:18] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598439 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:31:02] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598439 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:31:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/598437 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [10:33:19] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:598439| Bumping portals to master (598439)]] (duration: 01m 06s) [10:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:33:52] (03CR) 10Jbond: [C: 03+2] package_builder: add unstable sources to the build hosts [puppet] - 10https://gerrit.wikimedia.org/r/598433 (https://phabricator.wikimedia.org/T253407) (owner: 10Jbond) [10:33:55] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add proxy configuration for the build overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/598437 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [10:34:24] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:598439| Bumping portals to master (598439)]] (duration: 01m 05s) [10:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:49] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.54:443]) https://wikitech.wikimedia.org/wiki/PyBal [10:45:55] (03PS2) 10Arturo Borrero Gonzalez: toolforge-kubeadm: kubeadm 1.16 requires docker 18.09 [puppet] - 10https://gerrit.wikimedia.org/r/598093 (https://phabricator.wikimedia.org/T250866) (owner: 10Bstorm) [10:46:53] (03PS1) 10JMeybohm: admin: update tiller in mathoid namespace to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598441 (https://phabricator.wikimedia.org/T252428) [10:47:07] (03CR) 10Arturo Borrero Gonzalez: toolforge-kubeadm: kubeadm 1.16 requires docker 18.09 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598093 (https://phabricator.wikimedia.org/T250866) (owner: 10Bstorm) [10:49:25] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: Set correct port, fix config indentation. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598074 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:49:36] (03CR) 10jerkins-bot: [V: 04-1] changeprop-jobqueue: Set correct port, fix config indentation. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598074 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:53:49] (03CR) 10Ema: [C: 03+2] Fix panic on prometheus.GaugeVec label cardinality mismatch [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/598422 (https://phabricator.wikimedia.org/T253197) (owner: 10Ema) [10:55:48] (03PS2) 10Hnowlan: changeprop-jobqueue: Set correct port, fix config indentation. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598074 (https://phabricator.wikimedia.org/T220399) [11:01:36] !log upload prometheus-rdkafka-exporter to buster-wikimedia T253197 [11:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:39] T253197: Implement a prometheus exporter for rdkafka in golang - https://phabricator.wikimedia.org/T253197 [11:03:30] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: Set correct port, fix config indentation. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598074 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:03:53] (03Merged) 10jenkins-bot: changeprop-jobqueue: Set correct port, fix config indentation. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598074 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:04:35] (03CR) 10Jcrespo: "I am ok with the transfer.py change, but please send it as a different, first patch, and then send the documentation update in a second, d" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (owner: 10Privacybatm) [11:09:26] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [11:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:55] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [11:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:36] (03PS35) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [11:21:48] !log Extend db1141's (temporary labsdb test host) /srv 1TB extra - T249188 [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:51] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [11:23:18] !log Extend /srv 1100G on db114[1-9] T252512 [11:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:29] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:27:20] !log Extend /srv 1100G on db213[6-9] T252985 [11:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:25] T252985: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 [11:27:45] (03PS1) 10Ema: 0.7: use prometheus-rdkafka-exporter [software/atskafka] - 10https://gerrit.wikimedia.org/r/598444 (https://phabricator.wikimedia.org/T253197) [11:28:25] (03CR) 10jerkins-bot: [V: 04-1] 0.7: use prometheus-rdkafka-exporter [software/atskafka] - 10https://gerrit.wikimedia.org/r/598444 (https://phabricator.wikimedia.org/T253197) (owner: 10Ema) [11:28:28] (03PS36) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [11:28:40] (03PS2) 10Ema: 0.7: use prometheus-rdkafka-exporter [software/atskafka] - 10https://gerrit.wikimedia.org/r/598444 (https://phabricator.wikimedia.org/T253197) [11:29:27] (03PS3) 10Ema: 0.7: use prometheus-rdkafka-exporter [software/atskafka] - 10https://gerrit.wikimedia.org/r/598444 (https://phabricator.wikimedia.org/T253197) [11:36:51] <_joe_> !log switch mw[1349-1355,1364-1373].eqiad.wmnet to envoy [11:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:59] 10Operations, 10Traffic: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 (10ema) [11:39:05] 10Operations, 10Traffic: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 (10ema) p:05Triage→03Medium [11:39:43] (03PS4) 10Ema: 0.7: use prometheus-rdkafka-exporter [software/atskafka] - 10https://gerrit.wikimedia.org/r/598444 (https://phabricator.wikimedia.org/T253197) [11:43:49] 10Operations, 10Analytics, 10Analytics-Kanban: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) p:05Triage→03Medium [11:46:23] !log uploaded CAS 6.1.5-1 to apt.wikimedia.org T233947 [11:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:26] T233947: CAS build as a deb - https://phabricator.wikimedia.org/T233947 [11:48:27] !log Stop event scheduler on db1115 (tendril) - T252331 [11:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:30] T252331: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 [11:50:59] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:37] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [11:52:45] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [11:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:51] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:53:59] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) p:05Triage→03Low [11:54:38] !log Install a new tendril_purge_global_status_log event on db1115 (tendril) T252331 [11:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:41] T252331: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 [11:54:52] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [11:57:15] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [11:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:03] (03PS1) 10Ema: Remove varnishkafka::monitor::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/598445 (https://phabricator.wikimedia.org/T253555) [11:58:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "3G seems a bit generous but we can fine-tune it later if we want to." [deployment-charts] - 10https://gerrit.wikimedia.org/r/598435 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [12:00:05] (03PS1) 10Filippo Giunchedi: hieradata: disable SNI-only Envoy for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598446 (https://phabricator.wikimedia.org/T252186) [12:01:09] (03PS2) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [12:01:50] <_joe_> !log converting the remaining appservers to use envoy for TLS termination [12:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:14] (03CR) 10Ema: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1003/22720/" [puppet] - 10https://gerrit.wikimedia.org/r/598445 (https://phabricator.wikimedia.org/T253555) (owner: 10Ema) [12:05:55] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [12:06:50] (03CR) 10Muehlenhoff: [C: 03+1] "One more comment inline, otherwise LGTM" (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [12:07:46] (03PS1) 10Elukey: Set BigTop repository config for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/598450 (https://phabricator.wikimedia.org/T244499) [12:09:18] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:50] (03PS1) 10DannyS712: Fix a number of typos and duplicated words [puppet] - 10https://gerrit.wikimedia.org/r/598451 (https://phabricator.wikimedia.org/T201491) [12:10:02] (03PS2) 10Elukey: Set BigTop repository config for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/598450 (https://phabricator.wikimedia.org/T244499) [12:11:19] (03PS2) 10Filippo Giunchedi: hieradata: disable SNI-only Envoy for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598446 (https://phabricator.wikimedia.org/T252186) [12:11:22] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) To do: * Deploy the SwiftFileBackend config change, maintaining it at 900 * Reduce MultiHtt... [12:13:40] (03PS2) 10JMeybohm: admin: Increase maximum Pod memory to 3Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/598435 (https://phabricator.wikimedia.org/T235411) [12:14:12] (03PS2) 10DannyS712: Fix a number of typos and duplicated words [puppet] - 10https://gerrit.wikimedia.org/r/598451 (https://phabricator.wikimedia.org/T201491) [12:14:25] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:25] 10Operations, 10Traffic, 10Patch-For-Review: Implement a prometheus exporter for rdkafka in golang - https://phabricator.wikimedia.org/T253197 (10ema) 05Open→03Resolved a:03ema The package prometheus-rdkafka-exporter is now available in buster-wikimedia, closing. [12:18:25] (03PS1) 10Ema: atskafka: do not write stats to disk [puppet] - 10https://gerrit.wikimedia.org/r/598454 (https://phabricator.wikimedia.org/T253551) [12:18:29] (03CR) 10Ema: [C: 03+2] 0.7: use prometheus-rdkafka-exporter [software/atskafka] - 10https://gerrit.wikimedia.org/r/598444 (https://phabricator.wikimedia.org/T253197) (owner: 10Ema) [12:18:49] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/22722/thanos-fe2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598446 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:24:45] (03PS3) 10Muehlenhoff: Make CAS deployment via a deb toggleable [puppet] - 10https://gerrit.wikimedia.org/r/597228 (https://phabricator.wikimedia.org/T233947) [12:25:22] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: disable SNI-only Envoy for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/598446 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:25:56] (03CR) 10Elukey: [C: 03+2] Set BigTop repository config for the Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/598450 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [12:30:00] !log Deploy schema change on s5 directly on the master T253342 [12:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:03] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [12:34:23] (03CR) 10Ema: [C: 03+2] atskafka: do not write stats to disk [puppet] - 10https://gerrit.wikimedia.org/r/598454 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [12:35:45] (03PS1) 10Privacybatm: transfer.py: Modularize option_parse function [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598458 [12:36:44] (03PS1) 10Filippo Giunchedi: hieradata: set thanos-swift to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/598459 (https://phabricator.wikimedia.org/T252186) [12:37:00] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: set thanos-swift to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/598459 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:37:19] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [12:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:26] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:37:29] (03PS2) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 [12:39:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:12] (03PS37) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [12:41:33] (03CR) 10Jbond: [C: 03+2] Fix a number of typos and duplicated words [puppet] - 10https://gerrit.wikimedia.org/r/598451 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [12:42:21] (03CR) 10Jbond: "merged thanks" [puppet] - 10https://gerrit.wikimedia.org/r/598451 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [12:43:12] PROBLEM - Check systemd state on an-tool1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:03] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set thanos-swift to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/598459 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:44:38] !log upload atskafka 0.7 to buster-wikimedia, upgrade cp3050 T253551 [12:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:42] T253551: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 [12:47:16] (03PS1) 10Giuseppe Lavagetto: wmflib: fix the documentation of the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598462 [12:47:18] (03PS1) 10Giuseppe Lavagetto: wmflib: remove deprecated $::_roles variable from the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598463 [12:48:21] (03CR) 10JMeybohm: [C: 03+2] admin: Increase maximum Pod memory to 3Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/598435 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [12:48:43] (03Merged) 10jenkins-bot: admin: Increase maximum Pod memory to 3Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/598435 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [12:49:07] (03PS38) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [12:49:28] (03PS1) 10Ema: prometheus: job definition for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/598464 (https://phabricator.wikimedia.org/T253551) [12:51:38] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 56 connections established with conf2001.codfw.wmnet:2379 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [12:52:15] that's me ^ should resolve shortly [12:52:25] !log roll-restart pybal in low-traffic codfw [12:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:29] (03PS39) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [12:54:52] PROBLEM - ganeti-noded running on ganeti1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:56:38] RECOVERY - ganeti-noded running on ganeti1003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:57:24] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 57 connections established with conf2001.codfw.wmnet:2379 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [12:59:02] (03CR) 10Muehlenhoff: [C: 03+2] Make CAS deployment via a deb toggleable [puppet] - 10https://gerrit.wikimedia.org/r/597228 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [12:59:27] (03CR) 10Privacybatm: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (owner: 10Privacybatm) [13:01:08] (03CR) 10Privacybatm: "To have a better look at docs (easily), You can check this: https://transferpydoc.imfast.io/index.html :D" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (owner: 10Privacybatm) [13:01:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Merge tag 'debian/1.8.17-1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) (owner: 10Hashar) [13:02:22] PROBLEM - ganeti-noded running on ganeti1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:02:30] PROBLEM - ganeti-mond running on ganeti1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:02:58] PROBLEM - ganeti-mond running on ganeti1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:03:12] PROBLEM - ganeti-confd running on ganeti1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:03:28] 10Operations, 10DBA: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Kormat) a:03Kormat Here's a ~finished version of the PoC: {P11298} It has error handling, will silently exit if there is no LVM on the machine, and uses low-level lvm commands instead of need... [13:04:10] RECOVERY - ganeti-noded running on ganeti1002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:04:18] RECOVERY - ganeti-mond running on ganeti1001 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:06:02] PROBLEM - ganeti-confd running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:06:08] PROBLEM - ganeti-mond running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:06:20] PROBLEM - ganeti-noded running on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:06:20] PROBLEM - ganeti-noded running on ganeti1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:06:32] PROBLEM - ganeti-mond running on ganeti1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:06:40] PROBLEM - ganeti-noded running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:07:17] (03PS2) 10Jcrespo: transfer.py: Modularize option_parse function [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598458 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:07:18] PROBLEM - ganeti-confd running on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:07:18] PROBLEM - ganeti-mond running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:07:20] PROBLEM - ganeti-mond running on ganeti1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:07:26] PROBLEM - ganeti-noded running on ganeti1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:07:32] PROBLEM - ganeti-confd running on ganeti1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:07:32] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) As of today, all appservers use envoy too. [13:07:49] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Modularize option_parse function [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598458 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:08:06] RECOVERY - ganeti-noded running on ganeti1004 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:08:28] RECOVERY - ganeti-noded running on ganeti1006 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:08:38] RECOVERY - ganeti-confd running on ganeti1003 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:09:05] !log upgrade ATS to version 8.0.7-1wm11 on cp4026 and cp4032 [13:09:06] RECOVERY - ganeti-confd running on ganeti1004 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:08] RECOVERY - ganeti-mond running on ganeti1006 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:09:10] RECOVERY - ganeti-mond running on ganeti1002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:09:16] RECOVERY - ganeti-noded running on ganeti1007 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:09:22] RECOVERY - ganeti-confd running on ganeti1007 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:09:24] (03Merged) 10jenkins-bot: Merge tag 'debian/1.8.17-1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) (owner: 10Hashar) [13:09:38] RECOVERY - ganeti-confd running on ganeti1006 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:09:42] RECOVERY - ganeti-mond running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:09:56] RECOVERY - ganeti-noded running on ganeti1008 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:10:06] RECOVERY - ganeti-mond running on ganeti1007 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:10:07] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [13:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:12] RECOVERY - ganeti-mond running on ganeti1003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:10:30] (03CR) 10Jcrespo: "I am checking the output of the formatter on this first." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:11:58] (03PS1) 10Filippo Giunchedi: hieradata: set thanos-swift to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/598468 (https://phabricator.wikimedia.org/T252186) [13:13:44] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set thanos-swift to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/598468 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:14:49] (03PS2) 10Alexandros Kosiaris: kafka-dev: Drop redundant YAML doc starts [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 [13:14:51] (03PS2) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [13:18:19] (03PS1) 10Muehlenhoff: Remove absented reprepro configs [puppet] - 10https://gerrit.wikimedia.org/r/598469 [13:18:25] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [13:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:16] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:22:09] 10Operations, 10DBA: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Marostegui) Thanks for working on this. This is probably not an issue, but worth checking that this works as expected with both HP and Dell controllers. Again, shouldn't be an issue, but worth... [13:29:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [13:29:35] (03PS1) 10Ema: Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 [13:29:40] (03PS1) 10Filippo Giunchedi: hieradata: thanos-swift to production [puppet] - 10https://gerrit.wikimedia.org/r/598473 (https://phabricator.wikimedia.org/T252186) [13:30:29] (03CR) 10jerkins-bot: [V: 04-1] Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 (owner: 10Ema) [13:30:45] (03CR) 10Hashar: "I tried to setup some basic test and might have encountered a couple mistakes (there might be more). Then I am not familiar with puppet t" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:31:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] rake: Add kubeyaml validation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [13:31:06] (03CR) 10Jcrespo: "Please check the space between 'Bug:' & the ticket number, so we are consistent on all commits." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (owner: 10Privacybatm) [13:32:14] (03PS2) 10Filippo Giunchedi: hieradata: thanos-swift to production [puppet] - 10https://gerrit.wikimedia.org/r/598473 (https://phabricator.wikimedia.org/T252186) [13:34:33] (03PS1) 10Hashar: java: add some rspec-puppet tests [puppet] - 10https://gerrit.wikimedia.org/r/598474 [13:35:03] (03CR) 10jerkins-bot: [V: 04-1] java: add some rspec-puppet tests [puppet] - 10https://gerrit.wikimedia.org/r/598474 (owner: 10Hashar) [13:35:10] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: thanos-swift to production [puppet] - 10https://gerrit.wikimedia.org/r/598473 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:35:24] (03CR) 10Elukey: "Thanks a lot for the catches, sending an updated patch now" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:38:13] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598462 (owner: 10Giuseppe Lavagetto) [13:38:47] (03PS4) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [13:38:54] (03PS1) 10Muehlenhoff: Also deploy production IDPs via deb [puppet] - 10https://gerrit.wikimedia.org/r/598475 (https://phabricator.wikimedia.org/T233947) [13:39:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [13:40:38] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' . [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:01] (03PS1) 10Jbond: controller: fix index page creation [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598476 [13:42:03] (03PS1) 10Jbond: debug_host: add a script for debuging a specific host and change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598477 [13:42:05] (03PS1) 10Jbond: Prepare release: 0.7.6 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598478 [13:42:29] (03CR) 10jerkins-bot: [V: 04-1] controller: fix index page creation [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598476 (owner: 10Jbond) [13:42:31] (03CR) 10jerkins-bot: [V: 04-1] debug_host: add a script for debuging a specific host and change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598477 (owner: 10Jbond) [13:42:35] (03CR) 10jerkins-bot: [V: 04-1] Prepare release: 0.7.6 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598478 (owner: 10Jbond) [13:43:44] (03PS13) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [13:43:45] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-swift [13:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:57] (03PS1) 10Muehlenhoff: Remove jessie check in Docker registry [puppet] - 10https://gerrit.wikimedia.org/r/598479 [13:45:30] (03PS1) 10Filippo Giunchedi: Add thanos-swift discovery records [dns] - 10https://gerrit.wikimedia.org/r/598480 (https://phabricator.wikimedia.org/T252186) [13:46:14] (03CR) 10Filippo Giunchedi: [C: 03+2] Add thanos-swift discovery records [dns] - 10https://gerrit.wikimedia.org/r/598480 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:46:47] <_joe_> !log uploaded doxygen 1.8.17-1 to wikimedia-buster component/ci [13:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:49] (03PS2) 10Jbond: controller: fix index page creation [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598476 [13:46:58] (03PS2) 10Jbond: debug_host: add a script for debuging a specific host and change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598477 [13:47:03] (03PS2) 10Jbond: Prepare release: 0.7.6 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598478 [13:48:00] (03CR) 10Jbond: [C: 03+2] Prepare release: 0.7.6 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598478 (owner: 10Jbond) [13:48:04] (03CR) 10Jbond: [C: 03+2] debug_host: add a script for debuging a specific host and change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598477 (owner: 10Jbond) [13:48:07] (03CR) 10Jbond: [C: 03+2] controller: fix index page creation [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/598476 (owner: 10Jbond) [13:48:53] (03PS1) 10Muehlenhoff: Unconditionally enable mod-crs [puppet] - 10https://gerrit.wikimedia.org/r/598482 [13:49:23] 10Operations, 10doxygen, 10Continuous-Integration-Config, 10Developer Productivity, and 3 others: Update Doxygen in CI to 1.8.17 or greater - https://phabricator.wikimedia.org/T242155 (10Joe) The package has been uploaded. [13:49:42] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:28] (03PS1) 10Muehlenhoff: role::postgres::common: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/598484 [13:53:04] (03PS2) 10Ema: Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 [13:54:48] (03PS14) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [13:57:38] (03PS7) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) [13:58:15] (03PS3) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [13:58:55] (03PS1) 10JMeybohm: Enable atomic helm upgrades for all service deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) [13:59:09] (03CR) 10Privacybatm: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:00:17] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:26] (03PS3) 10Ema: Expose rdkafka prometheus metrics using promrdkafka [software/purged] - 10https://gerrit.wikimedia.org/r/598472 [14:00:38] (03CR) 10Privacybatm: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:03:04] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598479 (owner: 10Muehlenhoff) [14:04:20] (03PS1) 10Muehlenhoff: Switch the IDPs to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) [14:05:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [14:06:09] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:39] (03PS40) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:08:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:09:20] 10Operations, 10doxygen, 10Continuous-Integration-Config, 10Developer Productivity, and 3 others: Update Doxygen in CI to 1.8.17 or greater - https://phabricator.wikimedia.org/T242155 (10hashar) 05Open→03Resolved I have updated all the Jenkins jobs to use the new container. [14:11:15] (03CR) 10Jcrespo: [C: 03+2] "Thanks for this." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:11:31] (03PS3) 10Jcrespo: transfer.py: Modularize option_parse function [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598458 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:13:24] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Modularize option_parse function (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598458 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:14:08] (03PS4) 10Jcrespo: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:15:09] (03CR) 10Jcrespo: "This patch is larger, so please give me more time for full review. By splitting the previous ones into smaller ones, I am able to merge th" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [14:19:49] (03PS41) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:22:23] (03PS42) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:29:06] (03PS43) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:29:14] (03CR) 10Muehlenhoff: [C: 03+2] Include unzip package on extdist cloud vps tool for composer [puppet] - 10https://gerrit.wikimedia.org/r/598284 (https://phabricator.wikimedia.org/T215713) (owner: 10Brian Wolff) [14:29:37] 10Operations, 10LDAP: Problems accesing superset and horizon.wikimedia.org - https://phabricator.wikimedia.org/T253414 (10Aklapper) [14:33:14] (03PS44) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:35:43] (03PS45) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:35:48] 10Operations, 10LDAP: Problems accesing superset and horizon.wikimedia.org - https://phabricator.wikimedia.org/T253414 (10MoritzMuehlenhoff) Are Superset/Horizon the only services not working for you? https://wikitech.wikimedia.org/wiki/LDAP/Groups#wmf_group is a list of all services enabled by your "wmf" grou... [14:38:21] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [14:39:07] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:12] (03PS5) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [14:47:28] (03PS15) 10Hashar: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:49:28] (03CR) 10Hashar: [C: 03+1] "I went to write a basic spec suite which caught a few tiny mistakes that are not easy to catch on a manual review which lead to some fixes" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:49:44] (03Abandoned) 10Hashar: java: add some rspec-puppet tests [puppet] - 10https://gerrit.wikimedia.org/r/598474 (owner: 10Hashar) [14:52:17] (03PS46) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [14:52:40] PS46... that's gonna beat my record [14:53:09] (03PS1) 10Hnowlan: changeprop-jobqueue: Correct port used for liveness check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598493 (https://phabricator.wikimedia.org/T220399) [14:53:24] lol yes this one has taken a few iterarions, took a while before i was pursaded to get something local to better test [14:53:57] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:09] 10Operations, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10Epic, 10Performance Issue: [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394 (10Aklapper) 05Stalled→03Open The previous comments don't explain what/who exactly this task is stalled on (["If a... [15:13:14] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Kormat) I have a solution. A script that re-uses the existing `/` (formatted) and `/srv` (retained as-is) partitions by hooking into the partman internals. {P11300} [15:15:08] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:35] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:50] (03PS6) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [15:28:29] (03PS2) 10Hnowlan: changeprop-jobqueue: Correct port used for liveness check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598493 (https://phabricator.wikimedia.org/T220399) [15:37:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin: update tiller in mathoid namespace to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598441 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [15:39:52] (03CR) 10JMeybohm: [C: 03+2] admin: update tiller in mathoid namespace to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598441 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [15:41:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:41:44] (03PS2) 10JMeybohm: admin: update tiller in mathoid namespace to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598441 (https://phabricator.wikimedia.org/T252428) [15:42:52] (03CR) 10Alexandros Kosiaris: rake: Add kubeyaml validation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [15:43:39] (03PS3) 10Hnowlan: changeprop-jobqueue: Correct port used for liveness check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598493 (https://phabricator.wikimedia.org/T220399) [15:44:29] (03CR) 10Gehel: [C: 04-1] "Minor comments inline about code duplication." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [15:48:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:42] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: Correct port used for liveness check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598493 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:51:04] (03Merged) 10jenkins-bot: changeprop-jobqueue: Correct port used for liveness check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598493 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:54:24] 10Operations, 10LDAP: Problems accesing superset and horizon.wikimedia.org - https://phabricator.wikimedia.org/T253414 (10diego) HI @MoritzMuehlenhoff, just tried logtash, but failed too. Can we try to reset my password? [15:55:18] !log disable IX4/6 BGP group on cr4-ulsfo - T237575 [15:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:17] (03PS47) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [16:00:19] (03PS1) 10Jbond: profile::ganeti: quote hiera key with dots [puppet] - 10https://gerrit.wikimedia.org/r/598498 [16:00:20] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:57] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [16:16:07] !log enable IX4/6 BGP group on cr4-ulsfo - T237575 [16:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:37] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [16:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:41] (03PS2) 10Jbond: profile::ganeti: quote hiera key with dots [puppet] - 10https://gerrit.wikimedia.org/r/598498 [16:23:00] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [16:23:39] (03PS3) 10Jbond: profile::ganeti: use underscore instead of dots [puppet] - 10https://gerrit.wikimedia.org/r/598498 [16:24:05] (03PS48) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [16:24:17] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:49] (03PS4) 10Jbond: profile::ganeti: use underscore instead of dots [puppet] - 10https://gerrit.wikimedia.org/r/598498 [16:28:01] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) 05Open→03Resolved All the things are in place now: namely we're collecting SNMP data from the PDUs via `snmp_... [16:28:02] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/22736/" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [16:28:15] (03PS49) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [16:30:18] (03PS3) 10Ssingh: dnsdist: add a class to install and configure dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) [16:32:16] (03PS1) 10JMeybohm: admin: update tiller to 2.16.7-wmf1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/598501 (https://phabricator.wikimedia.org/T252428) [16:32:22] (03PS16) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [16:33:29] (03CR) 10JMeybohm: "New tiller looks good for mathoid (testes deploy, rollback, status, history, list)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/598501 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [16:33:36] (03CR) 10Ssingh: ">" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:37:18] (03CR) 10Jbond: "lgtm some minor nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:40:40] (03PS7) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [16:43:58] (03CR) 10Elukey: [C: 03+2] profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [16:44:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:46:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:49:31] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) (owner: 10RhinosF1) [16:57:10] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: update zone name for floating IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/598502 (https://phabricator.wikimedia.org/T247972) [16:59:20] (03PS1) 10Arturo Borrero Gonzalez: 57.15.185-in-addr.arpa: refresh zone name of the delegation [dns] - 10https://gerrit.wikimedia.org/r/598503 (https://phabricator.wikimedia.org/T247972) [16:59:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: designate: update zone name for floating IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/598502 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [17:00:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] 57.15.185-in-addr.arpa: refresh zone name of the delegation [dns] - 10https://gerrit.wikimedia.org/r/598503 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [17:05:36] (03CR) 10Ssingh: dnsdist: add a class to install and configure dnsdist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:06:08] (03PS4) 10Ssingh: dnsdist: add a class to install and configure dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) [17:07:46] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22741/" [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:12:22] (03PS1) 10Elukey: profile::java: set extra_args with type Optional[String] [puppet] - 10https://gerrit.wikimedia.org/r/598505 (https://phabricator.wikimedia.org/T253553) [17:13:54] (03CR) 10Elukey: [C: 03+2] profile::java: set extra_args with type Optional[String] [puppet] - 10https://gerrit.wikimedia.org/r/598505 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [17:15:19] RECOVERY - Check systemd state on an-tool1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:56] (03PS1) 10Elukey: profile::java: set defaults correctly accoring to their type [puppet] - 10https://gerrit.wikimedia.org/r/598507 (https://phabricator.wikimedia.org/T253553) [17:20:44] (03CR) 10Elukey: [C: 03+2] profile::java: set defaults correctly accoring to their type [puppet] - 10https://gerrit.wikimedia.org/r/598507 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [17:23:13] (03CR) 10Elukey: [C: 03+1] "Pcc looks good! https://puppet-compiler.wmflabs.org/compiler1003/22744/idp1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [17:30:21] (03PS1) 10MarcoAurelio: [WIP][nnwiki] Change category collation to `uca-nn-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598509 [17:32:27] (03PS2) 10MarcoAurelio: [nnwiki] Change category collation to `uca-nn-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598509 (https://phabricator.wikimedia.org/T253559) [17:57:30] (03PS1) 10Jbond: ignore: testing [puppet] - 10https://gerrit.wikimedia.org/r/598513 [17:57:44] (03PS8) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [18:01:12] (03Abandoned) 10Jbond: ignore: testing [puppet] - 10https://gerrit.wikimedia.org/r/598513 (owner: 10Jbond) [18:12:51] (03PS1) 10Jbond: ignore: testing [puppet] - 10https://gerrit.wikimedia.org/r/598515 [18:16:36] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/598515 (owner: 10Jbond) [18:29:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10hashar) doc.wikimedia.org is mostly static files. In the Apache config there is: ` # Lower ca... [18:30:02] (03PS1) 10Ssingh: acme_chief: update configuration to generate a certificate for malmok [puppet] - 10https://gerrit.wikimedia.org/r/598519 (https://phabricator.wikimedia.org/T252132) [18:33:44] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22750/" [puppet] - 10https://gerrit.wikimedia.org/r/598519 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:33:49] (03PS1) 10Jbond: wmcs::monitoring: add profile::grafana::ldap::bind_password: [labs/private] - 10https://gerrit.wikimedia.org/r/598521 [18:34:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] wmcs::monitoring: add profile::grafana::ldap::bind_password: [labs/private] - 10https://gerrit.wikimedia.org/r/598521 (owner: 10Jbond) [19:05:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Krinkle) Did the inconsistency last for over an hour? If not, I think this is expected given muti... [19:41:21] (03Abandoned) 10Jbond: ignore: testing [puppet] - 10https://gerrit.wikimedia.org/r/598515 (owner: 10Jbond) [19:41:50] (03PS1) 10Wolfgang Kandek: Changes for Locust 1.0.1 - version tested before was 0.14 Syntax changes in spec script Different comandline options as well [software/locust] - 10https://gerrit.wikimedia.org/r/598529 [19:42:51] (03CR) 10Krinkle: "Given lack of visual preview both in Gerit, Gitiles and Phab; I've done some visual checks in the file explorer (macOS Finder) and switch " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [19:44:03] (03CR) 10Krinkle: arclamp: add svgs for some key entrypoint/singleton methods calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598292 (owner: 10Aaron Schulz) [19:46:17] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.155e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:53:49] (03CR) 10Krinkle: arclamp: add svgs for some key entrypoint/singleton methods calls (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598292 (owner: 10Aaron Schulz) [20:03:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:04:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:12:51] (03PS5) 10Jbond: profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 [20:14:12] (03PS50) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:20:25] (03PS6) 10Jbond: profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 [20:21:30] (03CR) 10jerkins-bot: [V: 04-1] profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [20:24:05] (03PS7) 10Jbond: profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 [20:25:41] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [20:27:15] (03PS8) 10Jbond: profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 [20:28:35] (03PS9) 10Jbond: profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 [20:31:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:36:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:43:24] (03CR) 10Jbond: "ok really ready for review now :)" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [20:47:57] (03CR) 10Krinkle: arclamp: add svgs for some key entrypoint/singleton methods calls (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598292 (owner: 10Aaron Schulz)