[00:04:03] (03PS1) 10Catrope: [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) [00:04:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [00:05:09] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2035 and db2054 [dns] - 10https://gerrit.wikimedia.org/r/540728 (owner: 10Papaul) [00:08:16] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2035 and db2054 [dns] - 10https://gerrit.wikimedia.org/r/540728 (owner: 10Papaul) [00:10:25] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Papaul) [00:10:53] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [00:10:55] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Papaul) 05Open→03Resolved Complete [00:11:18] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Papaul) [00:11:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Papaul) 05Open→03Resolved complete [00:19:14] (03PS1) 10Papaul: DNS: Remove mgmt DNS for ms-be201[345] [dns] - 10https://gerrit.wikimedia.org/r/540734 [02:38:22] (03PS2) 10Krinkle: Use GTIDs for master position queries for external DB when possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [02:38:52] (03CR) 10Krinkle: "If this is good to go, I'd rather have a DBA roll it out, as I'm not able to monitor the impact well. We can be around together if you pre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [02:47:19] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31940832 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:01] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 75309496 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:55:57] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2256 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:57:07] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 98680 and 67 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:00:43] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [03:03:59] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:19:22] (03PS4) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [05:02:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es1019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540758 [05:03:07] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool es1019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540758 (owner: 10Marostegui) [05:04:18] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540759 [05:05:27] (03CR) 10Marostegui: [C: 03+1] "I am happy to deploy that (or be around if you do it) one day." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [05:06:26] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540759 (owner: 10Marostegui) [05:07:13] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540759 (owner: 10Marostegui) [05:08:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool es1019 after on-site maintenance T233698 (duration: 00m 53s) [05:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:31] T233698: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 [05:11:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P9240 and previous config saved to /var/cache/conftool/dbconfig/20191004-051112-marostegui.json [05:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:05] (03PS1) 10Marostegui: db-eqiad.php: More weight to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540760 [05:47:41] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540760 (owner: 10Marostegui) [05:48:41] (03Merged) 10jenkins-bot: db-eqiad.php: More weight to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540760 (owner: 10Marostegui) [05:49:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to es1019 after on-site maintenance T233698 (duration: 00m 51s) [05:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:46] T233698: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 [05:50:33] <_joe_> !log uploading confd 0.16.0 on stretch T147204 [05:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:37] T147204: Update confd package - https://phabricator.wikimedia.org/T147204 [05:53:15] <_joe_> !log upgrading confd on puppetmaster1001 T147204 [05:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:35] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540761 [05:57:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540761 (owner: 10Marostegui) [05:58:14] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540761 (owner: 10Marostegui) [05:59:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool es1019 after on-site maintenance T233698 (duration: 00m 51s) [05:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:46] T233698: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 [06:04:54] (03PS2) 10Marostegui: production-m5.sql.erb: Remove grants from designate_pool_manager [puppet] - 10https://gerrit.wikimedia.org/r/540534 (https://phabricator.wikimedia.org/T233978) [06:06:55] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Remove grants from designate_pool_manager [puppet] - 10https://gerrit.wikimedia.org/r/540534 (https://phabricator.wikimedia.org/T233978) (owner: 10Marostegui) [06:10:06] (03PS1) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) [06:10:59] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) (owner: 10Marostegui) [06:11:01] (03PS1) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) [06:11:24] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) (owner: 10Marostegui) [06:15:34] (03PS2) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) [06:16:47] !log Deploy schema change on dbstore1005:3316 T233135 T234066 [06:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:53] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:16:54] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [06:18:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1071.eqiad.wmnet` - db1071.eqiad.wmnet (**PASS**) - Downtimed host on Ic... [06:20:32] (03PS1) 10Marostegui: site.pp: Remove references to db1071 [puppet] - 10https://gerrit.wikimedia.org/r/540764 (https://phabricator.wikimedia.org/T229381) [06:21:04] (03PS1) 10Marostegui: wmnet: Remove production entries for db1071 [dns] - 10https://gerrit.wikimedia.org/r/540765 (https://phabricator.wikimedia.org/T229381) [06:21:29] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove references to db1071 [puppet] - 10https://gerrit.wikimedia.org/r/540764 (https://phabricator.wikimedia.org/T229381) (owner: 10Marostegui) [06:21:46] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production entries for db1071 [dns] - 10https://gerrit.wikimedia.org/r/540765 (https://phabricator.wikimedia.org/T229381) (owner: 10Marostegui) [06:22:00] <_joe_> !log downgrading confd back to 0.9.0 while some templates get fixed. [06:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Marostegui) a:05RobH→03Cmjohnson [06:23:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Marostegui) Host ready for on-site steps + switch port disablement [06:40:58] !log Deploy schema change on db2114 T233135 T234066 [06:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:05] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:41:05] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [06:59:32] (03PS12) 10Jcrespo: backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [07:10:25] (03CR) 10Jcrespo: Add a commit message guide (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [07:20:20] (03CR) 10Jcrespo: "> Patch Set 3:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [07:20:33] (03PS1) 10Elukey: Decommission kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/540768 (https://phabricator.wikimedia.org/T234600) [07:22:31] (03CR) 10Elukey: [C: 03+2] Decommission kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/540768 (https://phabricator.wikimedia.org/T234600) (owner: 10Elukey) [07:24:00] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [07:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:26] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [07:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:09] uff [07:26:44] !log execute gnt-instance remove kerberos1001 on ganeti1001 - T234600 [07:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:48] T234600: Decommission kerberos1001 - https://phabricator.wikimedia.org/T234600 [07:27:35] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:28:22] (03PS1) 10Elukey: Remove A/PTR records for kerberos1001 [dns] - 10https://gerrit.wikimedia.org/r/540775 (https://phabricator.wikimedia.org/T234600) [07:29:53] (03PS2) 10Elukey: Remove A/PTR records for kerberos1001 [dns] - 10https://gerrit.wikimedia.org/r/540775 (https://phabricator.wikimedia.org/T234600) [07:30:59] (03CR) 10Elukey: [C: 03+2] Remove A/PTR records for kerberos1001 [dns] - 10https://gerrit.wikimedia.org/r/540775 (https://phabricator.wikimedia.org/T234600) (owner: 10Elukey) [07:38:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:51:54] ah nice kerberos1001 got removed automagically from netbox [07:51:57] good! [08:06:29] 10Operations, 10Phabricator, 10Traffic: Phabricator inaccessible at WikiArabia 2019 - https://phabricator.wikimedia.org/T234598 (10Aklapper) a:05Aklapper→03None Hi, please do not assign tasks to people without their agreement and if they cannot fix these tasks. :) I can imagine that this might be intent... [08:24:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 51.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:27:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 82.84 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:32:10] <_joe_> !log reuploading the old confd package to stetch-wikimedia, some incompatibility detected [08:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:38] !log Deploy schema change on db2076 (sanitarium master) with replication T233135 T234066 [08:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:43] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [08:41:44] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [08:48:52] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10aborrero) p:05Normal→03High Raising priority of this ticket, since the ceph project is part of our Q2 goals. [08:49:13] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10aborrero) p:05Triage→03High Raising priority of this ticket, since the ceph project is part of our Q2 goals. [08:51:24] (03PS1) 10Elukey: Release upstream version 1.4.7 [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/540809 [08:52:32] Cc: gilles --^ [08:52:47] I am testing this for eventlogging in deployment-prep [08:52:54] I hope it will fix the rebalance problems [09:13:17] (03CR) 10Jcrespo: [C: 03+1] dumps-misc.sh.erb: This script is no longer in use [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [09:14:51] (03PS2) 10Jcrespo: dumps-misc.sh.erb: This script is no longer in use [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [09:16:01] (03PS3) 10Jcrespo: dumps-misc.sh.erb: Remove old misc db backup script [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [09:16:45] (03CR) 10Jcrespo: [C: 03+1] "Changed commit msg to make it not fail style guide and remove questions." [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [09:18:20] (03CR) 10Mobrovac: "There's also some discussion on the linked ticket where Alex valuably points out that if we were to keep RESTBase around even after the sp" [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [09:21:45] (03CR) 10Marostegui: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [09:21:52] (03PS4) 10Marostegui: dumps-misc.sh.erb: Remove old misc db backup script [puppet] - 10https://gerrit.wikimedia.org/r/540574 [09:22:44] (03CR) 10Marostegui: [C: 03+2] dumps-misc.sh.erb: Remove old misc db backup script [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [09:24:43] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:45:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC run seems to make sense https://puppet-compiler.wmflabs.org/compiler1001/18741/" [puppet] - 10https://gerrit.wikimedia.org/r/540643 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [09:45:55] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:12:46] (03PS1) 10Elukey: profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) [10:15:16] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [10:16:44] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) I had to roll back confd in reprepro and on puppetmaster1001 because I found a regression (or, a change in behaviour): the `prefix` key in the confd files isn't resp... [10:19:44] (03CR) 10Elukey: "recheck" [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/540809 (owner: 10Elukey) [10:39:53] (03PS1) 10Jbond: cumin: add an alias for spare::system [puppet] - 10https://gerrit.wikimedia.org/r/540837 [10:51:28] (03PS4) 10Mobrovac: restrouter: Revert the initialDelay seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:51:36] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] restrouter: Revert the initialDelay seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [11:01:17] (03CR) 10Jcrespo: "This is stalled until we see how backups will be done finally for analytics dbs." [puppet] - 10https://gerrit.wikimedia.org/r/538885 (https://phabricator.wikimedia.org/T231208) (owner: 10Jcrespo) [11:10:59] (03PS1) 10Mobrovac: RESTRouter: Bump image tag to v1.1.2 and release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540841 (https://phabricator.wikimedia.org/T223953) [11:14:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] RESTRouter: Bump image tag to v1.1.2 and release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540841 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [11:15:03] (03PS3) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 [11:15:31] (03CR) 10Jbond: "Sorry for the delay on this, it seems i address most things a while ago but did'nt hit reply" (037 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [11:23:20] (03PS1) 10Alexandros Kosiaris: Fully remove scap-helm [puppet] - 10https://gerrit.wikimedia.org/r/540843 (https://phabricator.wikimedia.org/T212130) [11:36:26] (03PS3) 10Phedenskog: Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) [11:45:54] (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/540846 [11:48:57] 10Operations, 10CX-cxserver, 10Citoid, 10Core Platform Team, and 9 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10mobrovac) [12:06:51] (03PS2) 10Elukey: profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) [12:10:13] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1708 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [12:16:10] sigh checking --^ [12:21:27] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1612 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [12:23:05] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [12:23:34] !log cleaned up old files and apt-cache from an-coord1001 [12:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:39] (03PS1) 10Jbond: debdeploy: Change the `--servers` flag to a global flag [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540849 [12:24:41] (03PS1) 10Jbond: debdeploy: add support for raw cumin query strings [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540850 [12:24:43] (03PS1) 10Jbond: debdeploy: refactor [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 [12:27:04] (03CR) 10Jbond: debdeploy: refactor (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 (owner: 10Jbond) [12:28:29] !log Deploy schema change on db2097:3316 T233135 T234066 [12:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:35] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [12:28:35] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [12:35:40] (03CR) 10Ottomata: [C: 03+1] Release upstream version 1.4.7 [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/540809 (owner: 10Elukey) [12:37:49] (03PS1) 10Marostegui: filtered_tables.txt: Add two new columns for abuse_filter_log [puppet] - 10https://gerrit.wikimedia.org/r/540855 (https://phabricator.wikimedia.org/T234052) [12:39:38] (03CR) 10Daimona Eaytoy: [C: 03+1] filtered_tables.txt: Add two new columns for abuse_filter_log [puppet] - 10https://gerrit.wikimedia.org/r/540855 (https://phabricator.wikimedia.org/T234052) (owner: 10Marostegui) [12:42:05] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Add two new columns for abuse_filter_log [puppet] - 10https://gerrit.wikimedia.org/r/540855 (https://phabricator.wikimedia.org/T234052) (owner: 10Marostegui) [12:57:46] (03PS8) 10Gehel: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:57:48] (03PS12) 10Gehel: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:57:50] (03PS9) 10Gehel: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:57:52] (03PS4) 10Gehel: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:57:54] (03PS4) 10Gehel: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [12:57:59] (03PS4) 10Gehel: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:01:42] (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/540846 [13:02:39] (03PS9) 10Gehel: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:02:41] (03PS13) 10Gehel: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:02:43] (03PS10) 10Gehel: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:02:45] (03PS5) 10Gehel: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:02:47] (03PS5) 10Gehel: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:02:49] (03PS5) 10Gehel: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:08:24] (03PS10) 10Gehel: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:08:26] (03PS14) 10Gehel: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:08:28] (03PS11) 10Gehel: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:08:30] (03PS6) 10Gehel: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:08:32] (03PS6) 10Gehel: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:08:34] (03PS6) 10Gehel: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:09:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: monitoring: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/540846 (owner: 10Arturo Borrero Gonzalez) [13:11:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [13:13:08] (03PS1) 10Giuseppe Lavagetto: confd: move all prefix declarations to the files [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) [13:13:43] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:13:50] (03CR) 10jerkins-bot: [V: 04-1] confd: move all prefix declarations to the files [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) (owner: 10Giuseppe Lavagetto) [13:15:29] (03PS11) 10Gehel: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:15:31] (03PS15) 10Gehel: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:15:33] (03PS12) 10Gehel: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:15:35] (03PS7) 10Gehel: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:15:37] (03PS7) 10Gehel: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:15:39] (03PS7) 10Gehel: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:19:15] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) Templates to verify: [] Redis replication [] Authdns [] Varnish [] Puppetmasters (config-master, mostly unused in production) [] dsh groups [13:22:39] (03PS12) 10Gehel: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:22:41] (03PS16) 10Gehel: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:22:43] (03PS13) 10Gehel: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:22:45] (03PS8) 10Gehel: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:22:47] (03PS8) 10Gehel: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:22:49] (03PS8) 10Gehel: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:26:31] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) Confirmed that `redis::multidc_instance` works as expected with the change of prefix, and the output is the same between the two versions. [13:28:22] (03CR) 10Gehel: [C: 03+1] "This now seems to be a noop (https://puppet-compiler.wmflabs.org/compiler1001/18746/)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:34:07] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Johan) This has now been added to Tech News. [13:34:13] (03CR) 10Mobrovac: [C: 03+2] RESTRouter: Bump image tag to v1.1.2 and release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540841 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [13:34:31] (03Merged) 10jenkins-bot: RESTRouter: Bump image tag to v1.1.2 and release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540841 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [13:35:54] 10Operations, 10DBA, 10MediaWiki-Logging, 10Wikimedia-Rdbms, and 5 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Anomie) [13:36:35] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [13:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:23] (03CR) 10Gehel: [C: 04-1] "See a few comments inline" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:40:30] 10Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 6 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079 (10Anomie) [13:47:31] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'production' . [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:59] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [13:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:03] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:03:21] !log Deploy schema change on db2117 T233135 T234066 [14:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:25] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [14:03:26] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [14:10:15] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:12:37] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:32:29] PROBLEM - Host ms-be1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:47] (03PS1) 10Jbond: admin: extend expiry for Shilad Sen [puppet] - 10https://gerrit.wikimedia.org/r/540883 [14:40:04] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) The start-up time is now pretty good: around 3-5s per worker. However, it seems... [14:43:22] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) It looks like both grafana and python-whisper are not available on buster ` labmon1002$ apt-cache policy python-whisper python-whisper: Installed: 0.9.15-... [14:44:38] (03CR) 10Jbond: [C: 03+2] admin: extend expiry for Shilad Sen [puppet] - 10https://gerrit.wikimedia.org/r/540883 (owner: 10Jbond) [14:59:05] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) Same for authdns - with the prefix change we obtain the same results with the two versions of confd. [15:00:13] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [15:00:28] <_joe_> uhm lemme see [15:01:15] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 687 days) https://wikitech.wikimedia.org/wiki/Logs [15:01:32] <_joe_> that is the timer fixing it, ugh :( [15:10:14] (03CR) 10Alexandros Kosiaris: Add a commit message guide (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [15:12:51] (03PS7) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [15:14:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll be bold here and merge. I 've addressed the various comments. It's anyway just a template and we can always revisit." [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [15:14:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks @everyone for the input" [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [15:14:55] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) (nevermind, I think I see what's happening) [15:35:07] (03PS1) 10Andrew Bogott: cloudbackup2001: update raid config [puppet] - 10https://gerrit.wikimedia.org/r/540898 (https://phabricator.wikimedia.org/T224528) [15:36:18] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10jrobell) Thank you David. Anything that can be done to expedite this process would be most helpful. For clarity, we first asked for this acces... [15:36:30] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2001: update raid config [puppet] - 10https://gerrit.wikimedia.org/r/540898 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [15:38:32] 10Operations, 10Wikimedia-Mailing-lists: disable WMFSF, keep archives - https://phabricator.wikimedia.org/T233883 (10eliza) Hello May we receive a status on this request? Thank you, Eliza [15:48:55] (03PS1) 10Andrew Bogott: cloudbackup partman: the second of many changes yet to come [puppet] - 10https://gerrit.wikimedia.org/r/540899 (https://phabricator.wikimedia.org/T224528) [15:50:14] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup partman: the second of many changes yet to come [puppet] - 10https://gerrit.wikimedia.org/r/540899 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [15:59:13] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:05:43] (03PS1) 10Andrew Bogott: cloudbackup: more partman tinkering [puppet] - 10https://gerrit.wikimedia.org/r/540906 (https://phabricator.wikimedia.org/T224528) [16:06:55] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup: more partman tinkering [puppet] - 10https://gerrit.wikimedia.org/r/540906 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [16:09:43] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:13:23] (03PS1) 10Alexandros Kosiaris: ORES: Make redis AOF configurable [puppet] - 10https://gerrit.wikimedia.org/r/540912 (https://phabricator.wikimedia.org/T233831) [16:19:50] (03PS1) 10Elukey: profile::eventlogging::analytics::server: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/540915 (https://phabricator.wikimedia.org/T222941) [16:20:34] ottomata: -^ [16:20:53] this is to unblock eventlog05 in deployment-prep (otherwise puppet fails etc..) [16:21:47] (03CR) 10Ottomata: [C: 03+1] profile::eventlogging::analytics::server: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/540915 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [16:22:13] (03PS2) 10Elukey: profile::eventlogging::analytics::server: allow to specify python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/540915 (https://phabricator.wikimedia.org/T222941) [16:24:02] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18748/" [puppet] - 10https://gerrit.wikimedia.org/r/540915 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [16:24:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 39 probes of 462 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:26:24] mmm that one fires quite often [16:26:30] maybe the threshold should be raised [16:27:39] could it be some real problem connecting via ipv6 though? [16:30:09] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 24 probes of 462 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:36:40] (03PS1) 10Andrew Bogott: cloudbackup: add some comments to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/540917 (https://phabricator.wikimedia.org/T224528) [16:37:56] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup: add some comments to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/540917 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [16:50:18] (03PS1) 1020after4: Deployment servers: install pygerrit2 for python3 [puppet] - 10https://gerrit.wikimedia.org/r/540920 [16:50:45] (03CR) 10Thcipriani: [C: 03+1] Deployment servers: install pygerrit2 for python3 [puppet] - 10https://gerrit.wikimedia.org/r/540920 (owner: 1020after4) [16:53:08] (03PS1) 10Andrew Bogott: labstore backups: make backup interval configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/540921 (https://phabricator.wikimedia.org/T224528) [16:54:22] 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10DED) I have signed the Acknowledgement of Wikimedia Server Access Responsibilities Document (L3) [16:55:34] 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10leila) [16:58:07] (03PS4) 10CDanis: Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) (owner: 10Phedenskog) [16:59:54] elukey: the steady-state is just below 35 -- i think 24-26 is about 'usual' [16:59:55] (03PS1) 10Andrew Bogott: cloudbackup2001: make a backup server [puppet] - 10https://gerrit.wikimedia.org/r/540923 (https://phabricator.wikimedia.org/T224528) [17:08:47] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Ricordisamoa) I seem to recall //[[ https://wikimediafoundation.org/news/2014/12/29/how-we-made-editing-wikiped... [17:47:27] (03CR) 10Bstorm: [C: 03+1] "Looks like it should be fine, especially since it won't start over the weekend :-D" [puppet] - 10https://gerrit.wikimedia.org/r/540923 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [17:48:09] (03CR) 10Andrew Bogott: [C: 03+2] labstore backups: make backup interval configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/540921 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [17:52:13] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup2001: make a backup server [puppet] - 10https://gerrit.wikimedia.org/r/540923 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [18:12:25] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10leila) [18:23:37] 10Operations, 10ops-codfw, 10Cloud-Services: Update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 (10Andrew) [18:34:00] 10Operations, 10ops-codfw, 10Cloud-Services: Build bdsync for Buster, or update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 (10Andrew) [18:46:24] (03CR) 10Dzahn: [C: 04-1] "needs mcrouter certs https://puppet-compiler.wmflabs.org/compiler1001/18751/wtp1025.eqiad.wmnet/change.wtp1025.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/540680 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [18:51:18] 10Operations, 10ops-codfw, 10Cloud-Services: Build bdsync for Buster, or update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 (10Andrew) @aborrero, do you have intuition about whether packaging bdsync for Buster is hard or easy? [19:00:34] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) [19:01:07] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) The network switch config is done, the main port is on public vlan and the 2nd port is on private [19:02:44] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) @aborreo I need to know vlan requirements? Same as cephosd? 1 public 1 private? [19:10:57] (03PS1) 10Dzahn: add fake mcrouter certs for ALL parsoid wtp hosts [labs/private] - 10https://gerrit.wikimedia.org/r/540947 (https://phabricator.wikimedia.org/T233654) [19:11:16] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) @Ricordisamoa The switch to Zend PHP 7.2 is not motivated by immediate speed gains. In 2015, we migra... [19:12:02] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "so that we can compile puppet changes on any wtp host" [labs/private] - 10https://gerrit.wikimedia.org/r/540947 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:12:40] (03PS2) 10Dzahn: add fake mcrouter certs for ALL parsoid wtp hosts [labs/private] - 10https://gerrit.wikimedia.org/r/540947 (https://phabricator.wikimedia.org/T233654) [19:13:09] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for ALL parsoid wtp hosts [labs/private] - 10https://gerrit.wikimedia.org/r/540947 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:16:29] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10Dzahn) host is shown as down in Icinga [19:17:42] (03PS1) 10Eevans: [WIP]: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) [19:18:24] (03CR) 10Eevans: [C: 04-1] "Not yet ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [19:18:32] 10Operations, 10MediaWiki-Sites, 10Patch-For-Review, 10SEO: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402 (10Nps1337) Thanks you'for you're Information sir i hopful it's make me great soon [[ https://www.vegus666.com/... [19:22:01] (03PS2) 10Dzahn: parsoid: turn wtp1025 into eqiad parsoid/php appserver [puppet] - 10https://gerrit.wikimedia.org/r/540680 (https://phabricator.wikimedia.org/T233654) [19:22:30] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18752/wtp1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/540680 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:23:20] (03PS3) 10Dzahn: parsoid: turn wtp1025 into eqiad parsoid/php appserver [puppet] - 10https://gerrit.wikimedia.org/r/540680 (https://phabricator.wikimedia.org/T233654) [19:23:58] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Bstorm) @Cmjohnson Yes, same. [19:27:28] !log wtp1025 - mediawiki appserver classes are being applied, install in progress will trigger some new icinga alerts [19:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:25] (03PS5) 10BryanDavis: sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [19:49:27] (03PS1) 10BryanDavis: Local testing improvements [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/540955 [19:57:06] PROBLEM - Nginx local proxy to apache on wtp1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:57:45] (03CR) 10Bstorm: "Just checking" (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [19:58:22] RECOVERY - Nginx local proxy to apache on wtp1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 0.610 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:58:50] ACKNOWLEDGEMENT - PHP opcache health on wtp1025 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:00:50] (03CR) 10Bstorm: "> Patch Set 5:" (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [20:04:00] (03Abandoned) 10Bstorm: labstore: add visualeditor project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/539437 (https://phabricator.wikimedia.org/T164992) (owner: 10Bstorm) [20:21:40] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) created mcrouter certs for ALL wtp eqiad and codfw hosts [20:24:19] (03PS1) 10EBernhardson: Repoint mjolnir daemons at deploy directory [puppet] - 10https://gerrit.wikimedia.org/r/540960 [20:25:26] (03PS1) 10Krinkle: static.php: Less visible distinction between HTTP 400 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540961 (https://phabricator.wikimedia.org/T204186) [20:25:28] (03PS1) 10Krinkle: static.php: Increase cdn ttl of "nohash" urls from 5min to 24h [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540962 [20:25:30] (03PS1) 10Krinkle: static.php: Set "Cache-Control: immutable" on long-cache responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540963 (https://phabricator.wikimedia.org/T149837) [20:25:32] (03PS1) 10Krinkle: static.php: Explicitly disallow dotfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540964 (https://phabricator.wikimedia.org/T204186) [20:25:46] (03CR) 10Krinkle: [C: 03+2] static.php: Less visible distinction between HTTP 400 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540961 (https://phabricator.wikimedia.org/T204186) (owner: 10Krinkle) [20:26:13] (03CR) 10BryanDavis: sssd: Add a whole duplicate hierarchy of sssd images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [20:26:21] (03CR) 10BryanDavis: [C: 03+2] Local testing improvements [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/540955 (owner: 10BryanDavis) [20:26:53] (03Merged) 10jenkins-bot: Local testing improvements [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/540955 (owner: 10BryanDavis) [20:27:08] (03Merged) 10jenkins-bot: static.php: Less visible distinction between HTTP 400 errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540961 (https://phabricator.wikimedia.org/T204186) (owner: 10Krinkle) [20:28:35] (03CR) 10SBassett: [C: 03+1] "Untested locally, most of this is cleanup work and the strpos() check seems completely sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540964 (https://phabricator.wikimedia.org/T204186) (owner: 10Krinkle) [20:29:56] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Dcljr) >>! In T120085#5545615, @Johan wrote: > [...] For example https://www.wikidata.org/wiki/Wikidata:Main_Page wo... [20:30:46] (03PS2) 10Krinkle: static.php: Increase cdn ttl of "nohash" urls from 5min to 24h [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540962 [20:30:48] (03PS2) 10Krinkle: static.php: Set "Cache-Control: immutable" on long-cache responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540963 (https://phabricator.wikimedia.org/T149837) [20:30:50] (03PS2) 10Krinkle: static.php: Explicitly disallow dotfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540964 (https://phabricator.wikimedia.org/T204186) [20:31:28] (03CR) 10jerkins-bot: [V: 04-1] static.php: Increase cdn ttl of "nohash" urls from 5min to 24h [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540962 (owner: 10Krinkle) [20:31:43] (03CR) 10jerkins-bot: [V: 04-1] static.php: Set "Cache-Control: immutable" on long-cache responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540963 (https://phabricator.wikimedia.org/T149837) (owner: 10Krinkle) [20:32:04] !log gerrit1001 - scp /usr/share/java/mysql-connector-java.jar from cobalt into /usr/share/java/ on gerrit1001 and then symlink into /var/lib/gerrit2/review_site/lib/ (T222391) [20:32:08] paladox: ^ [20:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:09] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [20:32:17] (03PS3) 10Krinkle: static.php: Increase cdn ttl of "nohash" urls from 5min to 24h [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540962 [20:32:19] (03PS3) 10Krinkle: static.php: Set "Cache-Control: immutable" on long-cache responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540963 (https://phabricator.wikimedia.org/T149837) [20:32:21] (03PS3) 10Krinkle: static.php: Explicitly disallow dotfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540964 (https://phabricator.wikimedia.org/T204186) [20:33:55] (03CR) 10Krinkle: [C: 03+2] static.php: Increase cdn ttl of "nohash" urls from 5min to 24h [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540962 (owner: 10Krinkle) [20:34:15] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for ms-be201[345] [dns] - 10https://gerrit.wikimedia.org/r/540734 (owner: 10Papaul) [20:34:33] * Krinkle staging on mwdebug1002 [20:34:47] (03Merged) 10jenkins-bot: static.php: Increase cdn ttl of "nohash" urls from 5min to 24h [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540962 (owner: 10Krinkle) [20:34:59] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for ms-be201[345] [dns] - 10https://gerrit.wikimedia.org/r/540734 (owner: 10Papaul) [20:35:13] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) Yes. [20:37:37] (03CR) 10Krinkle: [C: 03+2] static.php: Set "Cache-Control: immutable" on long-cache responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540963 (https://phabricator.wikimedia.org/T149837) (owner: 10Krinkle) [20:37:43] (03PS2) 10Dzahn: conftool: turn wtp1025 and wtp2001 into test servers [puppet] - 10https://gerrit.wikimedia.org/r/540684 (https://phabricator.wikimedia.org/T233654) [20:38:22] (03Merged) 10jenkins-bot: static.php: Set "Cache-Control: immutable" on long-cache responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540963 (https://phabricator.wikimedia.org/T149837) (owner: 10Krinkle) [20:38:50] (03CR) 10Krinkle: [C: 03+2] static.php: Explicitly disallow dotfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540964 (https://phabricator.wikimedia.org/T204186) (owner: 10Krinkle) [20:39:35] (03Merged) 10jenkins-bot: static.php: Explicitly disallow dotfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540964 (https://phabricator.wikimedia.org/T204186) (owner: 10Krinkle) [20:39:43] (03CR) 10Dzahn: [C: 03+2] Deployment servers: install pygerrit2 for python3 [puppet] - 10https://gerrit.wikimedia.org/r/540920 (owner: 1020after4) [20:39:52] (03PS2) 10Dzahn: Deployment servers: install pygerrit2 for python3 [puppet] - 10https://gerrit.wikimedia.org/r/540920 (owner: 1020after4) [20:41:11] Mutante: thanks! [20:41:53] !log deploy1001 / deploy2001 - remove python-pygerrit2 (version for python3 is needed instead) [20:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:00] !log krinkle@deploy1001 Synchronized w/static.php: 9648e03, 97d9384 (duration: 00m 53s) [20:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:24] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:44:18] (03PS1) 10Mholloway: [WiP] Update wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) [20:44:24] (03CR) 10Dzahn: "i removed the old package manually on both deploy1001 and deploy2001 and then let puppet install the new one" [puppet] - 10https://gerrit.wikimedia.org/r/540920 (owner: 1020after4) [20:45:03] (03PS1) 10Ottomata: Set up presto single node on analytics1030 in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/540968 [20:47:02] (03PS2) 10Ottomata: Set up presto single node on analytics1030 in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/540968 [20:53:58] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:57:43] (03CR) 10Thcipriani: [C: 03+1] "> i removed the old package manually on both deploy1001 and" [puppet] - 10https://gerrit.wikimedia.org/r/540920 (owner: 1020after4) [21:29:41] (03CR) 10Bstorm: sssd: Add a whole duplicate hierarchy of sssd images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [21:32:31] (03CR) 10Bstorm: [C: 03+2] sssd: Add a whole duplicate hierarchy of sssd images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/536692 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [21:39:09] (03PS1) 10Esanders: Remove defunct VisualEditorEnableNewMobileContext config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540972 [21:40:09] (03PS2) 10Esanders: Remove defunct VisualEditorEnableNewMobileContext config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540972 [21:50:26] (03PS2) 10Catrope: [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) [21:51:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [22:06:56] !log ms-be1020 - power cycle via mgmt - host down [22:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:46] RECOVERY - Host ms-be1020 is UP: PING WARNING - Packet loss = 73%, RTA = 0.23 ms [22:09:49] (03CR) 10Thcipriani: "couple of nits inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [22:10:26] PROBLEM - mediawiki-installation DSH group on wtp1025 is CRITICAL: Host wtp1025 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:11:45] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [22:15:35] 10Operations, 10media-storage: ms-be1020 - host went down - https://phabricator.wikimedia.org/T234698 (10Dzahn) [22:16:05] 10Operations, 10media-storage: ms-be1020 - host went down - https://phabricator.wikimedia.org/T234698 (10Dzahn) p:05Triage→03Normal [22:24:10] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 18.88 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:25:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 49.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:25:18] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 43.8 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:28:22] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 77.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:31:42] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.93 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:35:26] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 93.49 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:43:28] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) @ssastry @Joe **wtp1025.eqiad.wmnet** and **wtp2001.codfw.wmnet** are now the 2 hosts selected as the test/benchmarking hosts for parsoid/PHP. They are simply the... [23:00:45] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10bd808) @fgiunchedi can you give @Phamhi any tips on trying to get our `role::wmcs::monitoring` working on a Buster host? Is it a lost cause or something that should b... [23:14:09] (03CR) 10Dzahn: [C: 03+1] "please try to find an mw deployer for it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528290 (owner: 10Viztor) [23:15:14] (03CR) 10Dzahn: "assigning to you because i am not sure when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [23:19:05] (03CR) 10Dzahn: [C: 03+1] "linked ticket says this is currently "blocked by others" but i don't know the details." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [23:35:44] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Bawolff) >>! In T120085#5548420, @Dcljr wrote: >>>! In T120085#5545615, @Johan wrote: >> [...] For example https://w... [23:40:18] (03CR) 10Dzahn: [C: 03+1] "works in compiler. lgtm. maybe remove the WIP if you think it's ready? https://puppet-compiler.wmflabs.org/compiler1002/18753/restbase-de" [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) (owner: 10Holger Knust) [23:42:39] (03CR) 10Dzahn: [C: 03+1] "the linked ticket is closed as declined. does this mean this change is as well or does it still make sense, then let's get it approved." [puppet] - 10https://gerrit.wikimedia.org/r/518210 (https://phabricator.wikimedia.org/T220811) (owner: 10Muehlenhoff) [23:43:29] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:45:36] (03CR) 10Dzahn: "i am not sure if this should have been merged since July or is waiting for something" [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/522218 (https://phabricator.wikimedia.org/T222960) (owner: 10Eevans) [23:46:59] (03CR) 10Dzahn: "assigning back to paladox per Hashar's comments" [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [23:49:56] (03CR) 10Dzahn: [C: 04-1] "per IRC. we should be able to fix this on the LDAP server side so that labs is like prod. the difference is somewhere in the openldap conf" [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [23:54:05] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers