[00:05:51] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Jclark-ctr) All host racked and powered, console and network finished for all. Will update netbox and finish id... [00:07:11] (03CR) 10BryanDavis: maintain-kubeusers: add ability to merge and update configs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [00:35:51] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2147.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:45:39] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:47:02] (03CR) 10Andrew Bogott: [C: 03+2] Fix labsaliaser script to be executable [puppet] - 10https://gerrit.wikimedia.org/r/546756 (https://phabricator.wikimedia.org/T235627) (owner: 10Alex Monk) [01:47:36] (03PS4) 10Andrew Bogott: Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 (https://phabricator.wikimedia.org/T235627) (owner: 10Alex Monk) [01:48:27] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 (https://phabricator.wikimedia.org/T235627) (owner: 10Alex Monk) [01:49:19] (03CR) 10Andrew Bogott: [C: 03+2] "I bet that it hasn't been broken since the change, only since we rebuilt the server. But, regardless... let's see what it does now!" [puppet] - 10https://gerrit.wikimedia.org/r/546756 (https://phabricator.wikimedia.org/T235627) (owner: 10Alex Monk) [03:22:36] (03PS1) 10Revi: Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) [03:23:57] (03CR) 10Revi: "WIP pending message translation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [03:57:53] 10Operations, 10Traffic: varnish-fe is flooding the text backend caching layer with backend probe requests - https://phabricator.wikimedia.org/T236754 (10Vgutierrez) [04:57:11] (03CR) 10Masumrezarock100: [C: 03+1] Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [05:54:25] 10Operations, 10Traffic: Enforce POST size limit on ats-tls - https://phabricator.wikimedia.org/T236755 (10Vgutierrez) [05:59:45] (03PS1) 10Vgutierrez: ATS: Enforce POST size limit of 100mb on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/546790 (https://phabricator.wikimedia.org/T236755) [06:02:34] (03CR) 10Vgutierrez: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/19116/" [puppet] - 10https://gerrit.wikimedia.org/r/546790 (https://phabricator.wikimedia.org/T236755) (owner: 10Vgutierrez) [06:26:37] <_joe_> !log restart memcached on mc1023 T23518 [06:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:44] T23518: hide username box on Special:UserRights when user cannot change userrights from other users - https://phabricator.wikimedia.org/T23518 [06:27:12] <_joe_> uh what did I do wrong [06:27:31] <_joe_> paste fail [06:27:33] <_joe_> w/e [06:29:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change is obviously correct; if we're not on a clock I'd try to work on merging it with @RLazarus and try to test our new tool." [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [06:29:42] (03PS7) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [06:52:14] 10Operations, 10Discovery-Search, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) 05Open→03Resolved a:03elukey All right then, let's close this and re-open if necessary! [07:01:58] <_joe_> !log restart memcached on mc1024-1036, 1 hour apart, via cumin (T235188) [07:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:03] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [07:06:48] !log roll restart java daemons on analytics1042, druid1003 and aqs1004 to pick up new openjdk upgrades [07:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:47] !log installing php7.3 security updates [07:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:53] 10Operations, 10Traffic: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) 05Resolved→03Open Reopening cause the issue hasn't been solved as it can be seen here: https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?or... [07:44:55] FFS [07:44:56] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:48:11] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10elukey) [07:48:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) 05Open→03Stalled Pending T236757 [07:58:36] (03CR) 10Muehlenhoff: "deploy-design also needs to be added under access::groups for hieradata/role/common/deployment_server.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [08:06:05] !log restarting ats-tls on cp5007 with TCP FASTOPEN disabled - T236458 [08:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:11] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [08:14:58] 10Operations, 10netops: cr3-esams crash - https://phabricator.wikimedia.org/T236598 (10ayounsi) > We have found matching PR1179822, Chassisd might crash if lo0 filter is configured without allowing communication between RE and VM-host on RE. As a result,the internal interfaces are incorrectly examined by lo0 f... [08:15:26] !log push term allow_vmhost ro cr3-esams loopback4 filter - T236598 [08:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:32] T236598: cr3-esams crash - https://phabricator.wikimedia.org/T236598 [08:17:09] seems like it's loading up the hardware now [08:17:57] still a `warning: Could not connect to re1 : No route to host` on commit check [08:19:36] I'll let all the hardware boot up [08:20:51] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:03] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:06] RECOVERY - Host cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 84.22 ms [08:21:13] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:52] yas ^ [08:22:01] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:22:03] :D [08:22:10] (03PS1) 10Elukey: eventlogging::dependencies: absent python2 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/546830 (https://phabricator.wikimedia.org/T233231) [08:22:37] I saw that :-) [08:23:21] (03CR) 10Elukey: [C: 03+2] eventlogging::dependencies: absent python2 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/546830 (https://phabricator.wikimedia.org/T233231) (owner: 10Elukey) [08:24:02] 10Operations, 10netops: cr3-esams crash - https://phabricator.wikimedia.org/T236598 (10ayounsi) All the interfaces are back up and cr3-esams is now reachable and in service. One issue persists, re0 can't reach re1: ` ayounsi@re0.cr3-esams# commit check warning: Could not connect to re1 : No route to host war... [08:24:51] RECOVERY - Host cr3-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.07 ms [08:25:17] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 88, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:53] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 95, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:44] (03PS1) 10Elukey: eventlogging::dependencies: avoid absent for python-dateutil [puppet] - 10https://gerrit.wikimedia.org/r/546832 [08:28:05] (03CR) 10Elukey: [C: 03+2] eventlogging::dependencies: avoid absent for python-dateutil [puppet] - 10https://gerrit.wikimedia.org/r/546832 (owner: 10Elukey) [08:30:18] I am going to depool db1099 instances in preparation for later pdu work [08:30:34] we still have one link down, but it's a redundant one: "et-1/0/0 up down Core: asw2-esams:et-6/0/50 {#20049} [40Gbps DF]" [08:31:45] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 88, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:20] XioNoX: o/ is it a link to one onf the spines? [08:33:25] *one of [08:33:33] er, did everybody got paged to say that cr3-esams was back up? [08:33:39] I was yes [08:33:42] yeah :-D [08:33:42] :( [08:33:57] well a recovery page is not that bad :) [08:33:58] maybe there should be a delay like if something recoevers after X amount of time it doesn't page [08:34:29] XioNoX: that can be configured [08:34:48] by increasing soft tries or time between them [08:35:23] (03PS1) 10Elukey: eventlogging::dependencies: remove py2 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/546837 (https://phabricator.wikimedia.org/T233231) [08:35:49] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1099', diff saved to https://phabricator.wikimedia.org/P9492 and previous config saved to /var/cache/conftool/dbconfig/20191029-083547-jynus.json [08:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:17] (03CR) 10Elukey: [C: 03+2] eventlogging::dependencies: remove py2 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/546837 (https://phabricator.wikimedia.org/T233231) (owner: 10Elukey) [08:43:35] !log shutting down db1099 T227538 [08:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:40] T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 [08:53:23] (03PS1) 10Daniel Kinzler: Re-apply: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 [08:53:50] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10jcrespo) db1099 is depooled and down, please ping me on IRC when done so I can put it up. dbs area all ready for maintenance. [08:54:07] (03PS2) 10Daniel Kinzler: Re-apply: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 [08:55:00] (03PS3) 10Daniel Kinzler: Re-apply: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 (https://phabricator.wikimedia.org/T198558) [09:10:52] (03CR) 10Jcrespo: "> Patch Set 5: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:19:00] (03PS1) 10Filippo Giunchedi: install_server: use Buster for elastic 7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/546876 (https://phabricator.wikimedia.org/T234854) [09:21:38] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use Buster for elastic 7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/546876 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:27:28] !log restart ats-tls on cp5007 disabling TCP SO_LINGER - T236458 [09:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:34] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [09:27:56] !log reimage elastic 7 hw with Buster [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:59] !log plugin upgrade on relforge - T236123 [09:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:04] T236123: Deploy the experimental highlighter v6.5.4.1 to production - https://phabricator.wikimedia.org/T236123 [09:35:26] (03PS1) 10Gehel: elasticsearch: elasticsearch package renamed to elasticsearch-oss [cookbooks] - 10https://gerrit.wikimedia.org/r/546877 [09:36:35] (03CR) 10DCausse: [C: 03+1] elasticsearch: elasticsearch package renamed to elasticsearch-oss [cookbooks] - 10https://gerrit.wikimedia.org/r/546877 (owner: 10Gehel) [09:37:16] (03PS2) 10Filippo Giunchedi: Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) [09:39:11] (03CR) 10Filippo Giunchedi: Introduce Elastic 7 support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:41:17] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:48:13] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:17] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:24] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:39] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:50:39] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:37] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:21] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:23] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:40] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:55:40] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:26] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:11] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10faidon) >>! In T180641#5612728, @Dzahn wrote: > @Faidon RT is fixed Login works again. Thank you! [09:57:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [09:59:21] (03PS1) 10Jbond: puppet_compiler: manage facts dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/546883 (https://phabricator.wikimedia.org/T236717) [10:00:02] (03CR) 10Giuseppe Lavagetto: "AIUI volatile changes every day when we download the geoip databases, that are used on many different systems, not just the traffic ones." [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [10:01:21] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: manage facts dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/546883 (https://phabricator.wikimedia.org/T236717) (owner: 10Jbond) [10:01:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [10:07:01] (03CR) 10Filippo Giunchedi: [C: 03+1] ipsec: remove check_strongswan in favor of prometheus check [puppet] - 10https://gerrit.wikimedia.org/r/546666 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [10:07:08] !log disable cr3-esams:et-1/0/0 (flapping) [10:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:15] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:09:20] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 88, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:10:01] (03PS1) 10Urbanecm: Rename Author talk namespace at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546884 (https://phabricator.wikimedia.org/T236640) [10:11:03] (03PS2) 10Giuseppe Lavagetto: Revert "Remove the portforward right from deploy role" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544158 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [10:11:27] 10Operations, 10ops-esams: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767 (10ayounsi) p:05Triage→03Normal [10:11:30] (03PS2) 10Jbond: puppet_compiler: manage facts dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/546883 (https://phabricator.wikimedia.org/T236717) [10:11:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Remove the portforward right from deploy role" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544158 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [10:12:13] (03Merged) 10jenkins-bot: Revert "Remove the portforward right from deploy role" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544158 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [10:13:54] PROBLEM - Host logstash2021 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:06] oof, downtime expire [10:14:50] RECOVERY - Host logstash2021 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [10:15:15] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:05] (03CR) 10Jbond: [C: 03+2] puppet_compiler: manage facts dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/546883 (https://phabricator.wikimedia.org/T236717) (owner: 10Jbond) [10:17:06] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10jcrespo) I noticed db2062 isn't set on m1, is that on purpose because it is going to be decommissioned? Or becuase we didn't want it to alert? Or something else? CC @Marostegui [10:19:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/546638 (owner: 10Muehlenhoff) [10:19:44] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:56] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:02] !log running import on m1-master, m1 replicas will lag for a whileT236406 [10:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:26] (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546639 (owner: 10Muehlenhoff) [10:22:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [10:23:31] !log jakob@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [10:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:58] (03CR) 10Muehlenhoff: Fix attribute matching in Groovy script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546639 (owner: 10Muehlenhoff) [10:28:11] (03PS1) 10Jakob: Update termbox staging service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546889 [10:28:45] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/546889 (owner: 10Jakob) [10:29:24] !log installing php5 security updates [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:47] (03CR) 10Jbond: [C: 03+2] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [10:31:49] (03CR) 10Tarrow: [C: 03+2] "Confirmed available on the registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/546889 (owner: 10Jakob) [10:31:59] (03CR) 10Ladsgroup: [V: 03+2] Update termbox staging service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546889 (owner: 10Jakob) [10:32:01] (03Merged) 10jenkins-bot: Update termbox staging service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546889 (owner: 10Jakob) [10:32:03] (03PS1) 10Jakob: Update termbox codfw service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546890 [10:32:05] (03PS1) 10Jakob: Update termbox eqiad service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546891 [10:32:14] (03CR) 10jerkins-bot: [V: 04-1] Update termbox codfw service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546890 (owner: 10Jakob) [10:32:33] (03PS1) 10Giuseppe Lavagetto: Blubberoid: fix whitespace management [deployment-charts] - 10https://gerrit.wikimedia.org/r/546892 [10:33:49] !log jakob@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [10:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:53] (03CR) 10Jakob: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/546890 (owner: 10Jakob) [10:35:52] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/546890 (owner: 10Jakob) [10:36:19] 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active [10:36:34] PROBLEM - DPKG on contint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:36:40] <_joe_> XioNoX: ^^ known? [10:36:45] <_joe_> the cr3-esams alert [10:36:49] contint2001 is me, on it [10:37:09] <_joe_> yeah I was more worried about a core router in critical state on librenms [10:37:18] (03PS1) 10Jbond: base:puppet: update permissions on the ca [puppet] - 10https://gerrit.wikimedia.org/r/546893 [10:37:22] <_joe_> with all respect for the passive CI master [10:37:52] (03CR) 10Tarrow: [C: 03+2] Update termbox codfw service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546890 (owner: 10Jakob) [10:38:05] (03Merged) 10jenkins-bot: Update termbox codfw service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546890 (owner: 10Jakob) [10:39:14] !log jakob@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [10:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:36] RECOVERY - DPKG on contint2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:39:52] (03CR) 10Jbond: [C: 03+2] base:puppet: update permissions on the ca [puppet] - 10https://gerrit.wikimedia.org/r/546893 (owner: 10Jbond) [10:41:02] _joe_: thx, it's a known issue, only being surfaced now that cr3-esams is back online [10:41:32] <_joe_> XioNoX: I expected that was the case but wanted to make sure I didn't overlook something important [10:42:08] (03PS1) 10Gergő Tisza: Add growthexperiments dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) [10:42:10] (03PS1) 10Gergő Tisza: Use dblist for wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546895 [10:42:56] (03CR) 10jerkins-bot: [V: 04-1] Add growthexperiments dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [10:43:05] (03CR) 10jerkins-bot: [V: 04-1] Use dblist for wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546895 (owner: 10Gergő Tisza) [10:43:07] (03PS1) 10Gergő Tisza: mediawiki: maintenance script for purging old GrowthExperiments data [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) [10:43:20] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/546891 (owner: 10Jakob) [10:44:18] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Update termbox eqiad service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546891 (owner: 10Jakob) [10:44:24] (03PS2) 10Muehlenhoff: Pass the Groovy script with a file URI [puppet] - 10https://gerrit.wikimedia.org/r/546638 [10:45:20] (03PS1) 10Jbond: puppet_compiler: bump puppet-compiler version [puppet] - 10https://gerrit.wikimedia.org/r/546897 (https://phabricator.wikimedia.org/T236468) [10:46:20] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: use docker image from internal registry [puppet] - 10https://gerrit.wikimedia.org/r/546459 (https://phabricator.wikimedia.org/T236249) [10:46:28] (03PS2) 10Gergő Tisza: Add growthexperiments dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) [10:46:30] (03PS2) 10Gergő Tisza: Use dblist for wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546895 [10:46:31] !log jakob@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [10:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:42] 10Operations, 10Traffic: varnish-fe is flooding the text backend caching layer with backend probe requests - https://phabricator.wikimedia.org/T236754 (10ema) On cache_text we have a fairly significant number of VCL files stuck in the "auto/busy" state after having been discarded by our reload script. As an ex... [10:47:02] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump puppet-compiler version [puppet] - 10https://gerrit.wikimedia.org/r/546897 (https://phabricator.wikimedia.org/T236468) (owner: 10Jbond) [10:47:17] 10Operations, 10Traffic: varnish-fe is flooding the text backend caching layer with backend probe requests - https://phabricator.wikimedia.org/T236754 (10ema) p:05Triage→03Normal [10:48:07] 10Operations, 10Traffic: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 (10ema) [10:48:25] (03CR) 10Muehlenhoff: [C: 03+2] Pass the Groovy script with a file URI [puppet] - 10https://gerrit.wikimedia.org/r/546638 (owner: 10Muehlenhoff) [10:49:20] (03PS3) 10Muehlenhoff: Fix attribute matching in Groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546639 [10:51:36] (03PS4) 10Muehlenhoff: Fix attribute matching in Groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546639 [10:51:38] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:51:38] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:45] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:51:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:45] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Jclark-ctr) starting pdu refresh [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:02:26] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) I've created a copy of the bacula database on the bacula9 one, and then ran: ` sudo -u bacula ./update_mysql_tables -h m1-master.eqiad.wmnet bacul... [11:03:59] there is lag on m1, which I had downtimed, this is expected [11:05:06] (03CR) 10Jcrespo: "> Previous art is tests in the same directory and anything that 'nose' can find to run. Nose gets run via tox.ini plus changing rake_modul" [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [11:05:48] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:22] (03CR) 10Kosta Harlan: [C: 03+1] Add growthexperiments dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [11:09:32] (03CR) 10Muehlenhoff: [C: 03+2] Fix attribute matching in Groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546639 (owner: 10Muehlenhoff) [11:09:44] (03CR) 10Kosta Harlan: [C: 03+1] Use dblist for wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546895 (owner: 10Gergő Tisza) [11:10:13] (03CR) 10Gehel: [C: 03+2] elasticsearch: elasticsearch package renamed to elasticsearch-oss [cookbooks] - 10https://gerrit.wikimedia.org/r/546877 (owner: 10Gehel) [11:10:36] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:08] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [11:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:11] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [11:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:41] (03PS1) 10Hashar: releases: expand list of packages [puppet] - 10https://gerrit.wikimedia.org/r/546905 (https://phabricator.wikimedia.org/T236774) [11:20:43] (03PS1) 10Hashar: releases: remove php-xdebug [puppet] - 10https://gerrit.wikimedia.org/r/546906 (https://phabricator.wikimedia.org/T236774) [11:25:16] (03PS2) 10Giuseppe Lavagetto: Blubberoid: fix whitespace management [deployment-charts] - 10https://gerrit.wikimedia.org/r/546892 [11:25:21] I'm going to sneak into the ongoing SWAT window [11:25:28] (03PS2) 10Urbanecm: Rename Author talk namespace at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546884 (https://phabricator.wikimedia.org/T236640) [11:25:36] (03CR) 10Urbanecm: [C: 03+2] Rename Author talk namespace at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546884 (https://phabricator.wikimedia.org/T236640) (owner: 10Urbanecm) [11:25:52] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Remove php-xdebug from releases1001 / releases2001 - https://phabricator.wikimedia.org/T236774 (10hashar) [11:26:28] (03Merged) 10jenkins-bot: Rename Author talk namespace at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546884 (https://phabricator.wikimedia.org/T236640) (owner: 10Urbanecm) [11:28:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: fc9920e: Rename Author talk namespace at thwikisource (T236640) (duration: 00m 56s) [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:44] T236640: Rename namespace on Thai Wikisource - https://phabricator.wikimedia.org/T236640 [11:31:18] (03CR) 10Urbanecm: [C: 03+2] Allow AbuseFilter to issue blocks on es.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546700 (https://phabricator.wikimedia.org/T236730) (owner: 10MarcoAurelio) [11:31:54] (03Merged) 10jenkins-bot: Allow AbuseFilter to issue blocks on es.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546700 (https://phabricator.wikimedia.org/T236730) (owner: 10MarcoAurelio) [11:32:21] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [11:33:11] (03PS2) 10Urbanecm: Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [11:33:28] (03CR) 10Urbanecm: [C: 03+2] Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [11:33:47] what the f [11:33:56] it is WIP [11:34:22] (03Merged) 10jenkins-bot: Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [11:34:23] Urbanecm: hey [11:34:51] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: faeb8f1: Allow AbuseFilter to issue blocks on es.wikinews (T236730) (duration: 00m 53s) [11:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:55] T236730: Enable AbuseFilter blocker on es.wikinews - https://phabricator.wikimedia.org/T236730 [11:34:56] (03PS1) 10Muehlenhoff: Also pass down mfa-additional-method and mfa-force-method attributes [puppet] - 10https://gerrit.wikimedia.org/r/546909 [11:35:33] Urbanecm: hello? [11:35:37] Urbanecm: hi [11:35:38] revi: opps, sorry 🙂. [11:36:08] (03PS1) 10Urbanecm: Revert "Enable partial blocks on kowiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546910 (https://phabricator.wikimedia.org/T236752) [11:36:17] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Enable partial blocks on kowiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546910 (https://phabricator.wikimedia.org/T236752) (owner: 10Urbanecm) [11:37:07] !log EU SWAT done [11:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:14] it was clearly marked as WIP, pls take a look at the interface next time [11:37:54] revi: I'm afraid that state was losted after the review the patch got [11:38:12] It was lost because you uploaded PS2 [11:38:39] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/546780#message-39c7b3a966e134ef9a4ef55dcb65b7b3080a4d48 [11:39:20] revi: I think it was lost after https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/546780#message-aebc83fc472b46b2a12d30817a68237ececa0d39 [11:39:49] however, it still doesn't change that it was not on SWAT list [11:40:01] anyway, I'm sorry, will pay more attention next time 🙂 [11:40:07] ok :-) [11:40:28] Lesson: say DO NOT MERGE on commit log, don't trust web [11:43:30] (03PS3) 10Giuseppe Lavagetto: Blubberoid: fix whitespace management [deployment-charts] - 10https://gerrit.wikimedia.org/r/546892 [11:43:32] (03PS8) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [11:44:10] (03PS1) 10Urbanecm: Revert "Milestone lobo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546912 (https://phabricator.wikimedia.org/T236777) [11:44:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Blubberoid: fix whitespace management [deployment-charts] - 10https://gerrit.wikimedia.org/r/546892 (owner: 10Giuseppe Lavagetto) [11:45:06] (03Merged) 10jenkins-bot: Blubberoid: fix whitespace management [deployment-charts] - 10https://gerrit.wikimedia.org/r/546892 (owner: 10Giuseppe Lavagetto) [11:45:19] (03PS1) 10Revi: WIP, DO NOT MERGE: Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546913 (https://phabricator.wikimedia.org/T236752) [11:48:11] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [11:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:42] <_joe_> effie: do you want to do the prod releases? ^^ [11:49:27] no [11:49:29] :D [11:49:32] <_joe_> :D [11:50:10] jouncebot: now [11:50:11] For the next 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1100) [11:50:12] Urbanecm: ping? [11:50:20] hauskater: yup? [11:50:38] Urbanecm: still time to deploy an abusefilter.php patch? [11:51:18] hauskater: sure. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/546700 is already synced through ;) [11:51:18] oh wait, you SWATed it already [11:51:22] you're awesome [11:51:33] that's the one [11:52:59] enjoy your new feature hauskater [11:53:59] well I'm not really a Wikinews user [11:54:17] they have issues with... yeah you've guessed, spambots [11:54:28] PROBLEM - Host ms-be1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:55:22] PROBLEM - Host ms-be1047 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:28] RECOVERY - Host ms-be1047 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:57:21] (03PS1) 10Alexandros Kosiaris: filter_log: Avoid spaces in file name [puppet] - 10https://gerrit.wikimedia.org/r/546916 [11:58:47] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/19118/" [puppet] - 10https://gerrit.wikimedia.org/r/546909 (owner: 10Muehlenhoff) [12:00:08] RECOVERY - Host ms-be1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [12:01:18] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10akosiaris) [12:02:58] PROBLEM - Host ms-be1047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:04:00] PROBLEM - Host db1099.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:04:36] PROBLEM - Host cloudvirt1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:04:54] PROBLEM - Host cloudvirt1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:05:30] PROBLEM - Host cloudvirt1024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:05:30] PROBLEM - Host an-worker1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:05:30] PROBLEM - Host an-worker1083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:05:30] PROBLEM - Host cloudelastic1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:05:52] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Jclark-ctr) Finished with swapping of pdu`s [12:06:20] PROBLEM - Host cp1080.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:28] PROBLEM - Host cloudvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:36] PROBLEM - Host cp1079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:40] PROBLEM - Host analytics1072.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:40] PROBLEM - Host cloudvirt1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:41] PROBLEM - Host cloudvirt1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:44] PROBLEM - Host cloudvirt1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:07:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] backup: Migrate bacula director from helium to backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [12:07:36] PROBLEM - Host cloudvirt1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:07:56] PROBLEM - Host ms-be1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:08:46] RECOVERY - Host ms-be1047.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 1.11 ms [12:09:32] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:09:54] RECOVERY - Host cloudvirt1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [12:10:20] RECOVERY - Host cloudvirt1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [12:10:30] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [12:10:38] RECOVERY - Host cloudvirt1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [12:10:56] RECOVERY - Host db1099.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [12:11:14] RECOVERY - Host cloudelastic1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [12:11:14] RECOVERY - Host an-worker1083.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [12:11:14] RECOVERY - Host an-worker1084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [12:11:14] RECOVERY - Host cloudvirt1024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [12:12:00] RECOVERY - Host cp1080.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [12:12:10] RECOVERY - Host cloudvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [12:12:18] RECOVERY - Host cp1079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [12:12:22] RECOVERY - Host cloudvirt1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [12:12:22] RECOVERY - Host analytics1072.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [12:12:26] RECOVERY - Host cloudvirt1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [12:13:08] (03CR) 10MarcoAurelio: "It would be good to have this reviewed, yes. I'm not sure how to work with Reuven to have this reviewed and tested on his end though. Shal" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [12:13:18] RECOVERY - Host cloudvirt1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [12:13:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:13:38] RECOVERY - Host ms-be1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [12:13:54] (03PS2) 10Jcrespo: backup: Migrate bacula director from helium to backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) [12:15:09] (03CR) 10Jcrespo: [C: 03+2] backup: Migrate bacula director from helium to backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/544665 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [12:16:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:26:19] (03CR) 10Jbond: [C: 03+1] "looks good, one nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546909 (owner: 10Muehlenhoff) [12:29:06] !log delete all production00 volumes on backup1001 [12:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:26] !log Stopping Zuul / Jenkins for upgrade [12:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:40] !log Restarting Zuul / Jenkins [12:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:42] hauskater: ^^^ [12:33:57] oh [12:34:01] merci :) [12:34:19] (03PS1) 10Ayounsi: Revert "Smokeping, remove cr3-esams while it's not working" [puppet] - 10https://gerrit.wikimedia.org/r/546919 [12:39:56] (03PS2) 10Muehlenhoff: Also pass down mfa-additional-method and mfa-force-method attributes [puppet] - 10https://gerrit.wikimedia.org/r/546909 [12:43:05] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:45:51] (03CR) 10Ayounsi: [C: 03+2] Revert "Smokeping, remove cr3-esams while it's not working" [puppet] - 10https://gerrit.wikimedia.org/r/546919 (owner: 10Ayounsi) [12:47:09] 10Operations, 10netops: Configure conditional advertizing in eqdfw and knams - https://phabricator.wikimedia.org/T236785 (10ayounsi) [12:47:27] (03PS1) 10Hashar: Dummy build for CI [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 [12:48:12] (03CR) 10Hashar: "recheck CI added by https://gerrit.wikimedia.org/r/#/c/integration/config/+/546921/" [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [12:57:26] (03CR) 10Masumrezarock100: [C: 03+1] Revert "Milestone lobo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546912 (https://phabricator.wikimedia.org/T236777) (owner: 10Urbanecm) [12:58:40] (03CR) 10Jcrespo: "Filippo, allow me to merge this early and do both your suggestions at a later patch so to deploy the latest version with a buster compatib" [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:01:08] (03PS6) 10Jcrespo: bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) [13:01:39] (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [13:05:49] (03CR) 10Jcrespo: [C: 03+2] bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:05:51] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8995 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:07:25] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 3 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:07:53] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 92 (an-master1002, ...), No backups: 1 (webperf2002), Fresh: 1 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [13:08:57] ^that is expected [13:09:08] no new backups on the new host yet [13:11:33] (03CR) 10Hashar: "The repository lacks the upstream code / tags :D" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 (owner: 10Hashar) [13:12:14] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10fgiunchedi) [13:13:09] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) [13:13:11] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [13:14:01] memcached on mc1033 was restarted when the mw memcached error rate happened FYI [13:14:23] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10BBlack) [13:14:26] due to the mcrouter's TKO [13:14:35] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10BBlack) 05Open→03Resolved [13:14:37] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [13:15:20] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10BBlack) [13:15:31] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10BBlack) 05Open→03Resolved [13:15:44] (03CR) 10Hashar: "Should import tags and code from upstream repository https://github.com/dennisstritzke/ipsec_exporter" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 (owner: 10Hashar) [13:16:07] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:16:48] (03PS1) 10Jbond: puppet_compile: pass p9uppet version to the correct template [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546926 [13:17:07] (03PS2) 10Jbond: puppet_compile: pass puppet version to the correct template [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546926 [13:17:37] (03CR) 10Jbond: [C: 03+2] puppet_compile: pass puppet version to the correct template [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546926 (owner: 10Jbond) [13:17:44] 10Operations, 10ops-esams, 10Traffic: cp3036 PS Redundancy Lost - https://phabricator.wikimedia.org/T202627 (10BBlack) [13:17:49] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [13:18:03] 10Operations, 10ops-esams, 10Traffic: cp3032 PS Redundancy Lost - https://phabricator.wikimedia.org/T202046 (10BBlack) [13:18:06] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [13:18:23] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Jclark-ctr) [13:18:43] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Jclark-ctr) a:05Cmjohnson→03RobH [13:19:02] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) //TODO: Move log from /var/lib/bacula/log to /var/log/bacula/log [13:19:28] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [13:19:32] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Jclark-ctr) netbox updated/ serial connected [13:22:01] 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10BBlack) 05Open→03Resolved [13:23:01] (03CR) 10Hashar: "The Debian packaging with git buildpackage fails with:" [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [13:24:15] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission cp3007-cp3010 - https://phabricator.wikimedia.org/T208585 (10BBlack) [13:25:13] (03PS1) 10Alexandros Kosiaris: jessie: Downgrade to TLSv1 for backup [puppet] - 10https://gerrit.wikimedia.org/r/546928 (https://phabricator.wikimedia.org/T224549) [13:25:27] (03CR) 10Ottomata: "Wowee" [puppet] - 10https://gerrit.wikimedia.org/r/546830 (https://phabricator.wikimedia.org/T233231) (owner: 10Elukey) [13:28:36] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10BBlack) [13:28:52] (03PS1) 10Jbond: puppet_compile: use full puppet version in html report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546930 [13:29:18] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10jijiki) [13:29:48] (03CR) 10jerkins-bot: [V: 04-1] puppet_compile: use full puppet version in html report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546930 (owner: 10Jbond) [13:30:12] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10BBlack) 05Open→03Resolved As a batch these servers are complete in general. Note cp3056 had an early hardware issue that prevented progress, but this is tracked separately in:... [13:32:26] (03PS1) 10Hashar: Allow force push to upstream to populate repo [debs/prometheus-ipsec-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/546931 [13:32:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Ottomata) @Dzahn I'm on clinic duty this week! Does this need any more waiting period time or SRE meeting discu... [13:32:57] (03CR) 10Hashar: "That is to push master of https://github.com/dennisstritzke/ipsec_exporter to 'upstream' in Gerrit." [debs/prometheus-ipsec-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/546931 (owner: 10Hashar) [13:33:02] (03CR) 10Hashar: [V: 03+2 C: 03+2] Allow force push to upstream to populate repo [debs/prometheus-ipsec-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/546931 (owner: 10Hashar) [13:37:19] (03PS1) 10Hashar: Revert "Allow force push to upstream to populate repo" [debs/prometheus-ipsec-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/546932 [13:37:27] (03CR) 10Hashar: [V: 03+2 C: 03+2] Revert "Allow force push to upstream to populate repo" [debs/prometheus-ipsec-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/546932 (owner: 10Hashar) [13:37:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Ottomata) Those groups are correct, thanks Cole! @CGlenn If your manager can't approve, I suppose @nuria can? Also, for SS... [13:38:16] (03PS1) 10Andrew Bogott: horizon: add GITTILES_BASE_URL to config [puppet] - 10https://gerrit.wikimedia.org/r/546933 [13:39:01] (03CR) 10Andrew Bogott: [C: 03+2] horizon: add GITTILES_BASE_URL to config [puppet] - 10https://gerrit.wikimedia.org/r/546933 (owner: 10Andrew Bogott) [13:40:28] (03CR) 10DCausse: query_service: rename wdqs module to query_service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:42:08] (03PS2) 10Hashar: Dummy build for CI [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 [13:42:10] (03PS1) 10Hashar: Merge tag 'v0.3.1' [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546934 [13:44:58] (03PS3) 10Hashar: Dummy build for CI [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 [13:46:08] (03PS1) 10Jbond: puppet_compiler: fix unit tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546936 [13:46:16] (03Abandoned) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/546328 (owner: 10BBlack) [13:46:47] (03PS2) 10BBlack: Dots have meaning in regexes :P [puppet] - 10https://gerrit.wikimedia.org/r/546180 [13:47:26] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix unit tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546936 (owner: 10Jbond) [13:48:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Kind of quick and dirty, but should be ok for now. We can then revert backup to TLSv1.2 and remove this" [puppet] - 10https://gerrit.wikimedia.org/r/546928 (https://phabricator.wikimedia.org/T224549) (owner: 10Alexandros Kosiaris) [13:48:25] (03CR) 10BBlack: [C: 03+2] Dots have meaning in regexes :P [puppet] - 10https://gerrit.wikimedia.org/r/546180 (owner: 10BBlack) [13:50:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/546928 (https://phabricator.wikimedia.org/T224549) (owner: 10Alexandros Kosiaris) [13:52:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] jessie: Downgrade to TLSv1 for backup [puppet] - 10https://gerrit.wikimedia.org/r/546928 (https://phabricator.wikimedia.org/T224549) (owner: 10Alexandros Kosiaris) [13:52:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/546928 (https://phabricator.wikimedia.org/T224549) (owner: 10Alexandros Kosiaris) [13:52:26] (03PS4) 10Hashar: git buildpackage and packaging tweaks [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 [13:55:31] (03PS1) 10Alexandros Kosiaris: Remove an extraneous 's' character [puppet] - 10https://gerrit.wikimedia.org/r/546938 [13:55:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove an extraneous 's' character [puppet] - 10https://gerrit.wikimedia.org/r/546938 (owner: 10Alexandros Kosiaris) [13:56:43] (03CR) 10Alexandros Kosiaris: "> It could be a moving a target (more keys could be added/used). There are discussions going around about how to fix it (this pain is fel" [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [14:03:07] (03PS1) 10Herron: prometheus-ipsec-exporter: extend description [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546939 [14:04:10] (03CR) 10Hashar: "I switched the debian/changelog to "buster" and:" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 (owner: 10Hashar) [14:05:41] (03CR) 10Herron: [C: 03+2] prometheus-ipsec-exporter: extend description [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546939 (owner: 10Herron) [14:05:47] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [14:10:23] (03CR) 10Herron: [C: 03+2] prometheus-ipsec-exporter: extend description [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546939 (owner: 10Herron) [14:13:33] elukey: pong [14:14:33] here I am :) [14:14:54] so should I just upgrade python-kafka on webperf nodes? Or do you have a different procedure? [14:14:56] (03CR) 10Krinkle: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [14:15:06] elukey: I'm not sure, yeah, I guess? [14:15:31] Then to make sure the navtiming and other py services are restarted and I'll monitor graphs and journal [14:16:28] we can start from 2001 if you want [14:16:48] not sure if it works on real traffic or only waiting as standby [14:18:01] elukey: it's active-active for the most part [14:18:08] this is the last task for the record https://phabricator.wikimedia.org/T221848 [14:18:28] I think for one of them (navtiming.py) we use etcd to coordinate only one being active at once, but should generally be considered active-active [14:18:42] in any event starting with 2001 is gine [14:18:43] fin [14:18:44] e [14:18:59] all right, updating [14:19:21] (03PS1) 10Muehlenhoff: Bump to 6.1.0 which got released yesterday [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/546945 [14:19:23] done on 2001 Krinkle [14:20:05] OK. .figure out now how to restart the service or whether that would happen automatically by systemd [14:20:10] (03PS2) 10Hashar: Merge tag 'v0.3.1' [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546934 [14:20:12] (03PS5) 10Hashar: git buildpackage and packaging tweaks [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 [14:21:11] Krinkle: namely I have to do it or you are looking into it? [14:21:45] (03PS1) 10Jbond: puppetmaster1003: change puppetmaster to canary [puppet] - 10https://gerrit.wikimedia.org/r/546948 (https://phabricator.wikimedia.org/T235655) [14:21:47] (03PS1) 10Jbond: puppetmaster1003: switch puppetdbs so puppetdb1002 is prefered [puppet] - 10https://gerrit.wikimedia.org/r/546949 (https://phabricator.wikimedia.org/T235655) [14:21:50] (03PS1) 10Jbond: puppetdb: move icinga2001 to the new puppetdb server [puppet] - 10https://gerrit.wikimedia.org/r/546950 (https://phabricator.wikimedia.org/T235655) [14:23:30] elukey: he, may be better for you to do it :) [14:23:39] statsv, coal and navtiming [14:24:00] I'll document it on https://wikitech.wikimedia.org/wiki/Performance/Runbook/Webperf-processor_services for future referencce then [14:24:15] they are systemd units so I can restart them yes [14:24:28] I have sudo there but don't want to copy random commands from the internet. [14:24:38] okok restarting them then [14:24:39] What's the appropriate way in our infra to restart a systemd service? [14:24:48] systemctl restart nameoftheservice [14:24:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546948 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:24:55] (with sudo) [14:26:14] I can do it if you want, no problem for me [14:26:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546949 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:26:44] OK, I'll do it now [14:27:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546950 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:27:02] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [14:27:23] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: change puppetmaster to canary [puppet] - 10https://gerrit.wikimedia.org/r/546948 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:27:34] !log krinkle@webperf2001 Restart navtiming, coal and statsv services [14:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:52] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): cowbuilder cron update fails for buster - https://phabricator.wikimedia.org/T236796 (10hashar) [14:28:21] coal and navtiming look good [14:28:23] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Ottomata) I have added bawolff to the wmf LDAP group. @bawolff can you confirm you can access what you need? [14:28:30] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Ottomata) a:03Ottomata [14:28:51] statsv as well [14:28:55] elukey: next node :) [14:29:02] ack :) [14:29:45] !log upgrade python-kafka on webperf[12]001 - T234808 [14:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:50] T234808: Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 [14:30:03] elukey: also webperf11 in beta cluster [14:31:07] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): cowbuilder cron update fails for buster - https://phabricator.wikimedia.org/T236796 (10hashar) ` # head /var/cache/pbuilder/base-buster-amd64.cow/var/lib/apt/lists/mirrors.wikimedia.org_debian_dists_buster_InRe... [14:31:59] Krinkle: 1001 done [14:32:33] !log krinkle@webperf1001.eqiad Restart navtiming, coal and statsv services [14:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:39] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: switch puppetdbs so puppetdb1002 is prefered [puppet] - 10https://gerrit.wikimedia.org/r/546949 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:32:41] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) [14:33:22] Krinkle: webperf11 is already upgrade (cloud vms do auto upgrade packages) [14:33:31] *upgraded [14:33:57] (03PS20) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [14:35:08] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): cowbuilder cron update fails for buster - https://phabricator.wikimedia.org/T236796 (10hashar) 05Open→03Resolved a:03hashar I have just moved the file: ` # mv /var/cache/pbuilder/base-buster-amd64.cow/var... [14:35:10] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump to 6.1.0 which got released yesterday [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/546945 (owner: 10Muehlenhoff) [14:35:42] (03CR) 10Hashar: "recheck buster cow images were out of date due to T236796" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 (owner: 10Hashar) [14:35:45] LGTM [14:35:47] elukey: OK [14:37:07] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) [14:37:25] Krinkle: super, upgrade completed then :) [14:38:51] (03CR) 10Holger Knust: "The build issues are the result of newer versions of Pylint and Mypy. Pylint 2.3.1 which I used found not issues. Pylint 2.4.3 reported er" [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [14:40:03] elukey: Thanks! [14:41:01] (03CR) 10Jbond: [C: 03+2] puppetdb: move icinga2001 to the new puppetdb server [puppet] - 10https://gerrit.wikimedia.org/r/546950 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [14:41:12] (03PS2) 10Jbond: puppetdb: move icinga2001 to the new puppetdb server [puppet] - 10https://gerrit.wikimedia.org/r/546950 (https://phabricator.wikimedia.org/T235655) [14:42:01] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Ottomata) Ah, sorry @bawolff, I missed that you were no longer staff. Can you update your Phab and MW profile pages accordingly? I think we don't yet have a volunteer NDA on file f... [14:42:16] (03CR) 10Holger Knust: "As for the Debian build, I am still on a learning curve when it comes to the build process so take this with a grain of salt but my unders" [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [14:44:16] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, and 2 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10Ottomata) a:03ema Ema looks like you are working on this. Assigning to you as part of clinic duty. Feel free to resolve if done. [14:44:17] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:39] ^ that's me, silencing [14:45:46] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, and 2 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10ema) 05Open→03Resolved It is done, yes. Thanks @ottomata! [14:45:56] !log setting up ps1-b2-eqiad, librenms will output a couple reboots from it T227538 [14:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:02] T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 [14:46:45] (03PS1) 10Filippo Giunchedi: swift-add-machine: use per-port devices [software/swift-ring] - 10https://gerrit.wikimedia.org/r/546952 (https://phabricator.wikimedia.org/T222366) [14:48:04] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) [14:48:25] 10Operations, 10Machine vision, 10serviceops, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) [14:48:36] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) [14:50:14] (03PS1) 10CDanis: net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) [14:50:39] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:27] (03PS4) 10Daniel Kinzler: Re-apply: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 (https://phabricator.wikimedia.org/T198558) [14:52:14] (03CR) 10jerkins-bot: [V: 04-1] net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [14:52:51] (03PS1) 10Filippo Giunchedi: swift: default to multiple object servers per port [puppet] - 10https://gerrit.wikimedia.org/r/546956 (https://phabricator.wikimedia.org/T222366) [14:53:13] (03PS1) 10Vgutierrez: ATS: Track total client connections for HTTP/2 clients [puppet] - 10https://gerrit.wikimedia.org/r/546957 (https://phabricator.wikimedia.org/T236458) [14:53:38] (03PS1) 10RobH: setting ps1-b2-eqiad model info [puppet] - 10https://gerrit.wikimedia.org/r/546959 (https://phabricator.wikimedia.org/T227538) [14:53:53] (03PS2) 10CDanis: net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) [14:54:31] (03CR) 10RobH: [C: 03+2] setting ps1-b2-eqiad model info [puppet] - 10https://gerrit.wikimedia.org/r/546959 (https://phabricator.wikimedia.org/T227538) (owner: 10RobH) [14:55:25] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:36] (03CR) 10CDanis: [C: 03+2] swift-add-machine: use per-port devices [software/swift-ring] - 10https://gerrit.wikimedia.org/r/546952 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [14:55:45] (03CR) 10CDanis: [V: 03+2 C: 03+2] swift-add-machine: use per-port devices [software/swift-ring] - 10https://gerrit.wikimedia.org/r/546952 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [14:56:29] (03PS2) 10Filippo Giunchedi: swift: default to multiple object servers per port [puppet] - 10https://gerrit.wikimedia.org/r/546956 (https://phabricator.wikimedia.org/T222366) [14:57:41] 10Operations, 10serviceops, 10User-jijiki: mw2225 keeps sending cronspam for hhvm-needs-restart - https://phabricator.wikimedia.org/T236799 (10jijiki) [14:57:57] 10Operations, 10serviceops, 10User-jijiki: mw2225 keeps sending cronspam for hhvm-needs-restart - https://phabricator.wikimedia.org/T236799 (10jijiki) p:05Triage→03Low [14:58:49] !log eevans@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'echostore' for release 'staging' . [14:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:35] !log eevans@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' . [15:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:02] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10hashar) The instance has been added to Jenkins https://integration.wikimedia.org/ci/computer/compiler1003.puppet-diffs.eqiad.wmflabs/ Seems there are some foll... [15:03:18] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10RobH) [15:04:34] !log eevans@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'echostore' for release 'production' . [15:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:26] (03PS1) 10Muehlenhoff: Sync a few gradle.properties from upstream cas-overlay-template repo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/546961 [15:05:59] idp1001,, that is new [15:06:34] (03CR) 10Ema: [C: 03+1] ATS: Track total client connections for HTTP/2 clients [puppet] - 10https://gerrit.wikimedia.org/r/546957 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [15:06:42] don't see 'idp' on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions :) [15:06:47] identity provider? [15:07:07] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10jcrespo) I will repool the db host, as I do not want to leave them off for a long time. Let me know if more operations on this row are planned at a later time. [15:07:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Sync a few gradle.properties from upstream cas-overlay-template repo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/546961 (owner: 10Muehlenhoff) [15:07:42] (03CR) 10Vgutierrez: [C: 03+2] ATS: Track total client connections for HTTP/2 clients [puppet] - 10https://gerrit.wikimedia.org/r/546957 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [15:08:26] cdanis: yep, I'll add it there [15:08:47] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) [15:10:34] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10RobH) I've completed all setup work from the software side for the PDU at this time. [15:12:46] (03CR) 10Bstorm: maintain-kubeusers: add ability to merge and update configs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [15:14:21] (03PS1) 10Muehlenhoff: Revert to 6.1.0-rc4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/546962 [15:15:11] (03PS3) 10CDanis: net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) [15:15:45] (03CR) 10Jbond: "mostly fine, minor optimisation" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [15:15:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Nuria) @CGlenn I think the way to proceed here is to make you manager acquitted with phab and procedures arround data acces... [15:16:14] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert to 6.1.0-rc4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/546962 (owner: 10Muehlenhoff) [15:17:00] (03CR) 10Bstorm: [C: 03+1] toolforge: k8s: ingress: use docker image from internal registry [puppet] - 10https://gerrit.wikimedia.org/r/546459 (https://phabricator.wikimedia.org/T236249) (owner: 10Arturo Borrero Gonzalez) [15:17:23] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:59] (03PS4) 10CDanis: net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) [15:21:03] (03CR) 10jerkins-bot: [V: 04-1] net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [15:22:01] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10sbassett) [15:22:03] (03PS5) 10CDanis: net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) [15:22:06] (03PS1) 10Jbond: puppetboard: switch passive puppetboard to new puppetdbs [puppet] - 10https://gerrit.wikimedia.org/r/546965 (https://phabricator.wikimedia.org/T235655) [15:23:22] (03CR) 10Daniel Kinzler: "Up for SWAT on October 30, 11:00 UTC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546875 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [15:24:21] (03CR) 10CDanis: net_driver fact: add firmware_version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [15:24:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [15:24:51] (03PS6) 10CDanis: net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) [15:25:37] thanks jbond42! [15:25:45] !log restarting ats-tls on cp5007 with a default inactivity timeout of 5 minutes and half open disabled - T236458 [15:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:49] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [15:25:49] (03CR) 10CDanis: [C: 03+2] net_driver fact: add firmware_version [puppet] - 10https://gerrit.wikimedia.org/r/546953 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [15:26:31] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10WDoranWMF) @Eevans Should we move this to the Icebox? [15:27:23] np cdanis [15:28:09] (03PS2) 10Ottomata: releases: expand list of packages [puppet] - 10https://gerrit.wikimedia.org/r/546905 (https://phabricator.wikimedia.org/T236774) (owner: 10Hashar) [15:28:18] (03PS2) 10Ottomata: releases: remove php-xdebug [puppet] - 10https://gerrit.wikimedia.org/r/546906 (https://phabricator.wikimedia.org/T236774) (owner: 10Hashar) [15:28:27] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [15:29:12] (03PS1) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [15:29:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/546916 (owner: 10Alexandros Kosiaris) [15:29:54] haha I found a small bug [15:30:10] on vega.codfw here's the net_driver fact: {'ens5': {'speed': -1, 'driver': 'virtio_net', 'duplex': 'unknown', 'firmware_version': 'expansion-rom-version: '}} [15:30:15] (03CR) 10Ottomata: [C: 03+2] releases: expand list of packages [puppet] - 10https://gerrit.wikimedia.org/r/546905 (https://phabricator.wikimedia.org/T236774) (owner: 10Hashar) [15:30:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/546909 (owner: 10Muehlenhoff) [15:30:35] (03CR) 10Ottomata: [C: 03+2] releases: remove php-xdebug [puppet] - 10https://gerrit.wikimedia.org/r/546906 (https://phabricator.wikimedia.org/T236774) (owner: 10Hashar) [15:31:11] (03PS1) 10DCausse: [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 [15:31:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [15:31:28] (03CR) 10Jcrespo: "I will s/?/*/ + tests on the next patch- I have not forgotten." [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:32:03] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Remove php-xdebug from releases1001 / releases2001 - https://phabricator.wikimedia.org/T236774 (10Ottomata) a:03hashar Merged. Ran `apt-get purge php-xdebug` on releases1001 an releases2... [15:32:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) @Ottomata Thank you! No, i don't think it needs more discussion. But i think it needs to be amended (see... [15:33:04] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10jbond) @hashar i think if we get no more issue reports this week then we can rebuild 1001 and 1002 on Monday? [15:33:12] 10Operations, 10Traffic, 10Patch-For-Review: track NIC firmware version numbers across the fleet - https://phabricator.wikimedia.org/T236744 (10Ottomata) p:05Triage→03Normal a:03CDanis CDanis: assigning to you as part of clinic duty [15:33:26] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Remove php-xdebug from releases1001 / releases2001 - https://phabricator.wikimedia.org/T236774 (10Ottomata) p:05Triage→03Normal [15:34:01] (03PS1) 10Alexandros Kosiaris: bacula: Move logs to /var/log/bacula [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) [15:34:28] 10Operations, 10serviceops: Reimage mwdebug1002 and mw1317 - https://phabricator.wikimedia.org/T236806 (10jijiki) [15:34:38] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10Ottomata) a:03jcrespo Jaime, assigning to you, feel free to undo or reassign if this is not correct. [15:34:43] (03PS8) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [15:34:49] (03PS1) 10CDanis: net-driver fact: tweak regexp [puppet] - 10https://gerrit.wikimedia.org/r/546973 (https://phabricator.wikimedia.org/T236744) [15:34:51] (03CR) 10Joal: "Asking for confirmation the thing looks ok before adding purge." [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [15:34:53] (03CR) 10jerkins-bot: [V: 04-1] bacula: Move logs to /var/log/bacula [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) (owner: 10Alexandros Kosiaris) [15:35:00] (03PS5) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [15:35:09] 10Operations, 10DBA, 10serviceops, 10Patch-For-Review: Backups on buster hosts fail to run - https://phabricator.wikimedia.org/T235838 (10Ottomata) a:03jcrespo Jaime, assigning to you, feel free to undo or reassign if this is not correct. [15:35:54] (03CR) 10CDanis: [C: 03+2] net-driver fact: tweak regexp [puppet] - 10https://gerrit.wikimedia.org/r/546973 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [15:35:56] (03PS2) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [15:36:09] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) Thanks, it is indeed correct and this just happened today (even if alex did most of the work). Not closing because it is high... [15:37:01] (03CR) 10jerkins-bot: [V: 04-1] ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [15:38:12] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10kaldari) >I guess its not clear to me what exactly the decision to kill the server side component means. A... [15:38:22] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [15:38:43] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 3 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10Ottomata) a:03ema Ema this looks done, feel free to resolve. [15:39:24] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10Ottomata) a:03Ottomata Jaime, assigning to you, feel free to undo or reassign if this is not correct. [15:40:18] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10Ottomata) a:03Dzahn @dzahn, assigning to you, feel free to undo or reassign if this is not correct. [15:41:14] 10Operations, 10serviceops: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Ottomata) a:03Dzahn @Dzahn, assigning to you, feel free to undo or reassign if this is not correct. [15:41:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/546916 (owner: 10Alexandros Kosiaris) [15:42:22] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Ottomata) a:03Gehel @Gehel, assigning to you, feel free to undo or reassign if this is not correct. [15:43:09] (03CR) 10Jbond: [C: 03+2] puppetboard: switch passive puppetboard to new puppetdbs [puppet] - 10https://gerrit.wikimedia.org/r/546965 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:43:41] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10Ottomata) a:03akosiaris @akosiaris, assigning to you, feel free to undo or reassign if this is not correct. [15:43:43] (03PS3) 10Muehlenhoff: Also pass down mfa-additional-method and mfa-force-method attributes [puppet] - 10https://gerrit.wikimedia.org/r/546909 [15:44:44] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Gehel) p:05High→03Normal Since T230746 is almost there, we won't resize shards just now. Lowering priority. [15:46:13] 10Operations, 10DBA, 10serviceops, 10Patch-For-Review: Backups on buster hosts fail to run - https://phabricator.wikimedia.org/T235838 (10jcrespo) p:05High→03Low We believe this is fixed after T236406. Keeping it open until all hosts run at least once. [15:47:01] (03CR) 10Muehlenhoff: [C: 03+2] Also pass down mfa-additional-method and mfa-force-method attributes [puppet] - 10https://gerrit.wikimedia.org/r/546909 (owner: 10Muehlenhoff) [15:47:21] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/19121/" [puppet] - 10https://gerrit.wikimedia.org/r/546956 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [15:49:24] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Ottomata) It sounds like this particular problem is fixed.... [15:49:31] PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:21] (03CR) 10CDanis: [C: 03+1] swift: default to multiple object servers per port [puppet] - 10https://gerrit.wikimedia.org/r/546956 (https://phabricator.wikimedia.org/T222366) (owner: 10Filippo Giunchedi) [15:51:50] (03CR) 10CDanis: "ping :)" [puppet] - 10https://gerrit.wikimedia.org/r/546216 (owner: 10CDanis) [15:57:40] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:50] !log restarted ferm on labstore1006 -- it failed an external DNS lookup due to brief issues apparently on the other end [15:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:57] (03CR) 10Jbond: "Sorry i reviewed this but seems it never got sent. Anyway mostly look good to me but you said to be picky :)" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546216 (owner: 10CDanis) [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1600). [16:00:05] bawolff: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 3 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10eprodromou) 05Open→03Resolved Tested; it's done [16:00:43] Pchelolo: regarding the SSL cert for parsoid-php. i can confirm it is in the SANs on wtp1025 just like on wtp1026. It was likely not updated because puppet was disabled. [16:01:23] i can take the puppet swat change [16:01:31] (03CR) 10Jbond: base: shared definitions for port numbers in /etc/services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546216 (owner: 10CDanis) [16:02:33] mutante: oh, thank you. we're not really blocked by that right now, so it's not urgent. [16:02:41] but it would be neat to fix [16:04:15] Pchelolo: it should be fixed but let me know where you saw it [16:04:58] Pchelolo: also the error on deployment about restarts should hopefully be gone [16:04:58] mutante: I still can see it: 'curl https://parsoid-php.discovery.wmnet' from restbase1017 [16:05:27] Pchelolo: ok, hmm. on it [16:06:09] * bawolff_ waves [16:06:30] bawolff: oh yea.. i will merge it [16:06:36] Thanks :) [16:06:40] (03CR) 10Dzahn: [C: 03+2] Fix up CSP headers for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/546378 (https://phabricator.wikimedia.org/T213223) (owner: 10Brian Wolff) [16:07:18] I'll probably have another one later this week or next week once I'm sure it doesn't break anything, to turn on enforce mode [16:07:19] <_joe_> bawolff: ah sorry I forgot about the DST confusion week [16:08:01] All of my clocks auto-change DST, I totally didn't realize it was happening until suddenly it started getting dark out really early [16:08:25] <_joe_> bawolff: yeah I just assumed puppet-swat will happen at 6 pm as usual [16:08:41] bawolff: deployed on doc1001 [16:09:07] <_joe_> Pchelolo: can I get the error you see pasted somewhere? [16:09:24] _joe_: "curl: (51) SSL: no alternative certificate subject name matches target host name 'parsoid-php.discovery.wmnet'" [16:09:27] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:09:28] Confirmed seeing new policy [16:09:43] <_joe_> wow [16:09:48] <_joe_> that's a big spike on api [16:10:07] i see the SAN error. confirmed. but the certs in the repo have it. i'll check why that is [16:10:50] And at first glance it looks like the new policy is working better than the old. Thanks everyone :D [16:11:01] (03PS1) 10Reedy: Enable WebAuthn on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546975 [16:11:03] bawolff: great [16:11:45] jouncebot: now [16:11:45] For the next 0 hour(s) and 48 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1600) [16:12:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:18:11] <_joe_> mutante: you didn't reload nginx on the wtp* hosts [16:18:16] all the wtp hosts in eqiad have /etc/ssl/localcerts/parsoid.svc.eqiad.wmnet.crt with DNS:parsoid-php.discovery.wmnet in it. all the wtp hosts in codfw do not have the cert at all [16:18:17] <_joe_> that's why you don't see the new cert [16:18:34] i was going to say ..must be missing restart becaues it looks fine on disk on all ..via cumin [16:18:46] <_joe_> reload, not restart [16:19:01] _joe_: ack [16:19:02] i was about to deploy parsoid now .. should i wait? [16:20:08] mutante, _joe_ ^ [16:20:15] subbu: 1 minute [16:20:19] ok [16:20:22] <_joe_> yeah around 1 minute :D [16:20:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add parsoid-php to the discovery records to switchover [cookbooks] - 10https://gerrit.wikimedia.org/r/545167 (owner: 10Giuseppe Lavagetto) [16:20:43] (03PS2) 10Giuseppe Lavagetto: Add parsoid-php to the discovery records to switchover [cookbooks] - 10https://gerrit.wikimedia.org/r/545167 [16:20:52] !log reloading nginx on wtp* [16:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] <_joe_> Pchelolo: can you retry now? I'm pretty sure things will be fixed [16:21:32] curl from restbase1017 now working [16:21:37] _joe_: perfecto! [16:21:48] <_joe_> ok great [16:21:50] subbu: also let's see the restarts work i hope :) [16:21:52] <_joe_> subbu: green light [16:22:15] k [16:22:22] !log ssastry@deploy1001 Started deploy [parsoid/deploy@d932d6a]: Update parsoid to 089bf28d [16:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:40] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [16:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:47] I'll update restbase to point to the discovery endpoint instead of eqiad now [16:23:07] [restbase1017:~] $ curl https://parsoid-php.discovery.wmnet [16:23:08] mutante, no, doesn't work. [16:23:18] 2019-10-29 16:22:45,215 [ERROR] Executing command systemctl restart php7.2-fpm.service failed: Command '['systemctl', 'restart', 'php7.2-fpm.service']' returned non-zero exit status 1 [16:23:18] 2019-10-29 16:22:45,216 [WARNING] Service restart failed. NOT repooling [16:23:22] <_joe_> Pchelolo: sweet [16:23:53] subbu: hrmm, ok, looks like there was a second issue [16:24:01] <_joe_> uhm [16:24:11] <_joe_> no it does not [16:24:13] ok .. do you want to see the full stack race? [16:24:25] not stack trace .. but the error output. [16:24:48] <_joe_> Oct 29 11:34:45 wtp1025 sudo: mwdeploy : TTY=unknown ; PWD=/var/lib/mwdeploy ; USER=root ; COMMAND=/usr/local/sbin/check-and-restart-php php7.2-fpm 100 [16:24:52] <_joe_> I see this [16:24:54] <_joe_> this morning [16:25:01] <_joe_> but nothing now on wtp1025 [16:25:15] <_joe_> subbu: where did you get the error from? [16:25:25] one sec. [16:25:29] from scap [16:25:38] <_joe_> yeah I mean which server returned the error [16:25:52] https://gist.githubusercontent.com/subbuss/7298ea2cf4fe620499e238bc4ec84fc6/raw/7193025098a77b2370296afc596782966e69f278/gistfile1.txt [16:25:57] all canaries [16:26:20] <_joe_> ok so [16:26:46] <_joe_> mutante: we need to repool the servers that were depooled, can you take care of that? [16:27:18] <_joe_> ok I think I have a couple fixes in mind [16:27:25] should i roll back the deploy for now? [16:27:31] <_joe_> subbu: no reason [16:27:39] <_joe_> subbu: actually, yes [16:27:40] ok. so, continue to all hosts? [16:27:41] <_joe_> sorry [16:27:52] <_joe_> gimme 5 mins and I'll have another patch for you [16:28:00] ok .. so, for now, rollback? [16:28:10] <_joe_> yes [16:28:12] k [16:28:18] <_joe_> and we re-deploy with the subsequent patch [16:28:26] sounds good. [16:28:33] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@d932d6a]: Update parsoid to 089bf28d (duration: 06m 11s) [16:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:02] ^^ scap should say: rolled back to ... instead of saying it finished deploy to the new version. [16:29:05] <_joe_> mutante: can you repool the hosts that were depooled [16:30:03] _joe_: ok! [16:30:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1026.eqiad.wmnet,service=parsoid-php [16:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:48] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet,service=parsoid-php [16:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2001.codfw.wmnet,service=parsoid-php [16:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2002.codfw.wmnet,service=parsoid-php [16:31:21] so i tried to already debug this yesterday and i merged the change to sudoer privilege.. [16:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:23] <_joe_> mutante: this is the actual fix https://gerrit.wikimedia.org/r/#/c/mediawiki/services/parsoid/deploy/+/546976 [16:32:24] and that restart should be "user deploy-service running it as root" unless i get it wrong [16:32:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:32:50] _joe_: ah! there's the missing sudo. gotcha [16:33:48] _joe_, should i +2 https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/deploy/+/546976 ? [16:33:50] My watchlist is showing " [16:33:50] Due to high database server lag, changes newer than 92 seconds may not appear in this list. [16:33:50] " [16:33:51] everything is pooled in pybal [16:35:03] <_joe_> AntiComposite: we're looking into it [16:35:08] <_joe_> AntiComposite: which wiki? [16:35:20] enwiki, not commons [16:35:42] <_joe_> subbu: I guess so yes :) [16:35:46] <_joe_> and then retry a deploy [16:35:49] 10Operations: Upgrade to 6.1.0 - https://phabricator.wikimedia.org/T236815 (10MoritzMuehlenhoff) [16:35:51] lots of errors on 10.64.32.222:3311 [16:36:11] _joe_, got it. [16:36:32] db1105:3311 is lagging, probably due to load [16:36:52] <_joe_> subbu: lmk when you deploy again [16:36:58] 10Operations, 10Traffic, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10eprodromou) 05Open→03Resolved a:03eprodromou [16:37:18] ya .. wiating on jenkins to merge. [16:37:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:38:28] <_joe_> jynus: lmk if I can help [16:38:49] well, we are also in reduced redundancy because db1099 was stopped [16:38:56] so I cannot pool it in [16:39:35] I can pool another server to see if that helps [16:39:58] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [16:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:40:52] <_joe_> jynus: can I help in any way? [16:40:56] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Bawolff) >>! In T236636#5615394, @Ottomata wrote: > Ah, sorry @bawolff, I missed that you were no longer staff. Can you update your Phab and MW profile pages accordingly? > > I thi... [16:42:01] cdanis, _joe_ I have an error Group "contributions" is not configured in section "s1 [16:42:10] for dbctl instance db1106 pool --section s1 --group contributions [16:42:33] _joe_, ok, going to deploy again now. /cc mutante [16:42:42] <_joe_> jynus: uhm lemme look at why [16:43:06] <_joe_> that's pretty strange [16:43:06] !log ssastry@deploy1001 Started deploy [parsoid/deploy@aa59ce3]: Update parsoid to 089bf28d [16:43:06] jynus: ah, i think you need to do dbctl instance db1106 edit [16:43:08] and add it [16:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:12] ok [16:43:13] <_joe_> oh ok [16:43:35] that is not fast [16:43:38] <_joe_> right, we might want to make the error message clearer [16:43:48] I have to add it to 4 groups [16:43:50] both the UI and the error message [16:43:53] ah :| [16:43:54] subbu: ok, let me know if the error is gone [16:43:56] yeah that could be a lot better [16:44:08] <_joe_> jynus: ftr, you can do them all at once [16:44:13] <_joe_> since it's an editor [16:44:16] _joe_, mutante yes, worked this time. [16:44:22] <_joe_> subbu: yeah I just saw [16:44:24] <_joe_> ok [16:44:31] <_joe_> can you check the code is updated too? [16:44:32] subbu: :) cool. closing https://phabricator.wikimedia.org/T236275 ? [16:44:47] we need it for beta cluster still. [16:44:54] not yet sure it works there. [16:44:58] will retry later today. [16:45:04] <_joe_> there it should be a different check [16:45:20] <_joe_> jynus: if you need a CR of your edit, I'm happy to [16:45:27] PROBLEM - MariaDB Slave Lag: s1 #page on db1105 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:45:49] is that for the host back from maintenance? [16:46:03] right. beta has the separate "file not found" issue (https://phabricator.wikimedia.org/T236275#5612915) [16:46:16] no [16:46:26] (03CR) 10EBernhardson: [C: 03+1] [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 (owner: 10DCausse) [16:46:29] it is the pooled host that has issues [16:46:32] the other is depooled [16:46:41] !log jynus@cumin1001 dbctl commit (dc=all): 'pool db1106 into s1 rcs', diff saved to https://phabricator.wikimedia.org/P9497 and previous config saved to /var/cache/conftool/dbconfig/20191029-164640-jynus.json [16:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:05] let's see if that fixes it or just moves the problem [16:47:14] <_joe_> you now have both pooled [16:47:26] <_joe_> so the mw db load balancer should pick the one without lag, right? [16:47:27] yeah, but the other is depooled synamically [16:47:28] on code [16:47:30] yeah [16:47:41] rc seems back [16:47:47] we will see if tmp only [16:47:56] <_joe_> so if it heals, we can hope both will be available [16:48:14] or we win time until db1099, the maintenance one comes back [16:48:30] if it heals, we depool the damaged one [16:48:41] RECOVERY - MariaDB Slave Lag: s1 #page on db1105 is OK: OK slave_sql_lag Replication lag: 11.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:48:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:49:19] things look better, but we will see [16:49:26] if it is oveload, it will be temporary [16:49:39] if it is host failure, it will be fixed [16:49:44] <_joe_> it's a bit strange, being db1105 multi-instance [16:49:52] <_joe_> if it was a server issue [16:50:08] "server" as in [16:50:19] only for that instance, could be anything really [16:50:20] <_joe_> as in physical host [16:50:27] yeah, I got you [16:50:31] (03CR) 10Elukey: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [16:50:33] I was explaining what I meant [16:50:49] <_joe_> so I would guess it's something specific, even some query-induced trouble [16:50:55] so db1105 keeps lagging after being dynamicayy repooled [16:51:05] let's see db1106 [16:51:21] <_joe_> tendril says lag is 0 [16:51:30] yeah, it seems fine [16:51:39] db1099:3311 was in these groups, do we think that just depooling it resulted in too much load on db1105? [16:51:40] so I will depool db1105 [16:51:50] oh i see [16:52:01] cdanis: no, see that db1106 was completely cols [16:52:03] <_joe_> cdanis: no, our working hypothesis is that db1105's instance of s1 has $something wrong [16:52:03] cold [16:52:08] and it maanged well [16:52:11] got it [16:52:12] and without partitioning [16:52:15] <_joe_> maybe some statistics are garbled up [16:52:17] early to say [16:52:23] <_joe_> so bad query plans [16:52:26] <_joe_> or anything really [16:52:31] don't know yet, no analysis [16:52:41] will fix ongoing issue first to go to a stabler status [16:52:41] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@aa59ce3]: Update parsoid to 089bf28d (duration: 09m 35s) [16:52:43] <_joe_> yeah I was exemplifying the kind of issues [16:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:47] later analysys [16:52:49] yep [16:52:59] will depool db1105 [16:52:59] <_joe_> sorry it wasn't meant to be distracting you jynus [16:53:04] <_joe_> +1 [16:53:09] not you are not, just clarifying [16:53:11] subbu: looking at deployment-prep. deployment-mediawiki-parsoid10. it _does_ have the puppet roles that includes the php::restarts class, so it's not that. [16:53:13] thanks for the help [16:53:23] being alone, it is helpful you have you around [16:53:49] depooling the other host so it doesn't flop dynamically [16:54:16] <_joe_> mutante: the problem is there is no conftool so some scripts don't exist [16:54:25] <_joe_> mutante: just modify the check in beta [16:54:42] <_joe_> to do 'sudo systemctl restart php7.2-fpm' [16:54:55] _joe_: alright! [16:55:32] db1099 looks to be replicating fairly quickly at least [16:55:47] <_joe_> it's depooled though [16:55:58] <_joe_> db1105:3311 did replicate well while depooled :) [16:55:59] yeah, it happened just when we had no redundancy [16:56:12] that is bad luch [16:56:15] *luck [16:56:16] but when db1099 is caught up, we could repool it, right? [16:56:19] yes [16:56:35] !log jynus@cumin1001 dbctl commit (dc=all): 'depool fully db1105:3311, stability/lag issues', diff saved to https://phabricator.wikimedia.org/P9498 and previous config saved to /var/cache/conftool/dbconfig/20191029-165633-jynus.json [16:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:52] db1099 lag 0 [16:57:03] did db1105:s1 recovered its lag while depoiled? [16:57:08] I didn't see the message [16:57:13] db1105 lag 5 [16:57:19] lag 0 now [16:57:28] interesting [16:57:31] <_joe_> jynus: it did earlier when you pooled db1105 [16:57:35] <_joe_> err 06 [16:57:46] <_joe_> then it got probably repooled by the mw load balancer [16:57:50] <_joe_> and started lagging again [16:57:56] but if it flaps between 0-5 it means issues [16:58:08] it should always have 0-1 even under high stress [16:58:18] as in, it normally can handle that [16:58:31] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) >>! In T92471#5615626, @WDoranWMF wrote: > @Eevans Should we move this to the Icebox? ya. [16:58:38] cdanis: let me check db1099 [16:58:44] we normally pool it slowly [16:58:47] after a restart [16:58:57] but in this case it may make sense to be less gnetler [16:59:11] <_joe_> jynus: I see a high number of threads on that instance btw, and also innodb page IO [16:59:28] on db1099 or db1105? [16:59:28] <_joe_> a huge spike at the start of the troubles [16:59:32] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10RStallman-legalteam) @Ottomata thanks for checking in. It appears that Brian is on a contract through 2/28/20 and signed a confidentiality agreement through T&C that should cover thi... [16:59:32] <_joe_> db1105 [16:59:35] maybe hw issues [16:59:45] but it also has s8 [16:59:52] which has an ever higher write rate [16:59:53] strange [17:00:02] <_joe_> lemme see how s8 is doing there [17:00:04] cscott, arlolra, subbu, halfak, accraze, and mdholloway: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1700). [17:00:11] <_joe_> yeah s8 has no troubles [17:00:16] oh ,sorry [17:00:22] that would be db1099 [17:00:27] <_joe_> lol [17:00:29] <_joe_> ok [17:00:32] it should have another section [17:00:37] let me see which one [17:00:52] <_joe_> s2 [17:00:54] s2 [17:00:59] which is lower write rate [17:01:03] specially during my day [17:01:06] <_joe_> ok [17:01:32] so it coulds still be hw, but with a tipping point [17:01:45] we may want to depool it fully for investigation [17:02:01] also, db1099 being depooled wouldn't help [17:02:15] but I think it would only be a factor, and not a root cause [17:02:16] <_joe_> yeah let's focus on repooling db1099 if possible for now [17:02:27] <_joe_> yeah I concur that's probable [17:02:29] we over provision those host quite a lot [17:02:42] checking db1099 buffer status [17:02:42] <_joe_> but it can wait for us to restore redundancy on rc in s1 [17:02:44] no s1 repl lag on any instance since 16:59 btw [17:02:48] <_joe_> or even tomorrow morning [17:03:05] possible confounding factor, the large s8 write load stopped at that time as well [17:03:20] write load? [17:03:42] mmmmh... wikidate enters the fight :-D [17:03:52] s8 rows written was 5k wps before 16:22, and then was 15k-20k wps from 16:24-17:00 [17:04:02] and now is back to 'normal' ish, about 7.5k [17:04:07] yeah, not the first time those spikes happen [17:04:35] ok, I am going to repool fully db1099 on both instances [17:04:40] then reevaluate [17:04:43] +1 [17:05:28] <_joe_> +1 [17:05:31] <_joe_> :) [17:06:49] <_joe_> jynus: I would step afk now as I have meetings later, but if you need support I can stay around [17:06:50] !log jynus@cumin1001 dbctl commit (dc=all): 'repool db1099 both instances fully to increase redundancy', diff saved to https://phabricator.wikimedia.org/P9499 and previous config saved to /var/cache/conftool/dbconfig/20191029-170648-jynus.json [17:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:55] thanks [17:07:00] if cdanis works now [17:07:07] that would be enough support [17:07:08] I am around [17:07:15] <_joe_> if he doesn't, we can try to fix him [17:07:19] just in case I have more dbctl questions [17:07:25] <_joe_> :P [17:07:39] I am not really that familiar with it, manuel has been doing most of the work with that [17:08:39] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:09:44] yeah, I pooled a bit too quickly [17:09:44] db1099 briefly spiked to 4 lag on s1, but recovered quickly [17:09:51] it got 300 errors [17:09:59] <_joe_> that is consistent with a cold cache I guess [17:10:12] yeah, that is why we normally do a slower ramp up [17:10:26] even if the cache was alredy full [17:10:39] (03CR) 10Herron: "This LGTM for monitoring message send times from the mailman software itself." [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [17:10:48] <_joe_> jynus: on the positive side, if you had to deploy with scap, it would have taken years with zuul lagging behind :P [17:10:55] lol [17:11:33] there are things that I found myself slow with, some due to not being very familiar with the workflow [17:11:47] it should be easier to add sections to a host [17:11:48] some because are different than the 'ol times [17:11:54] would be easy to write a quick command for that [17:12:11] <_joe_> cdanis: it was an explicit request to make it impossible to add groups by accident btw [17:12:13] I was caough by surprise [17:12:20] _joe_: ofc it should be more explicit [17:12:32] because I didn't realize they needed to be in berore chaniging the pool weight [17:12:33] <_joe_> so either a confirmation [17:12:41] the error message on pool --section should be better, and there should be a.. addtosection command [17:12:49] <_joe_> or a better error message [17:12:49] something like that [17:12:50] and also happens that the rc group is in realy u like 5 groups [17:12:51] <_joe_> yeah [17:12:56] which will soon disappear [17:13:10] and just be server by general load [17:13:16] jynus: yeah, we had talked about group aliases / groups-of-groups but we decided agianst it [17:13:20] so lot of bad things happen at the same time [17:13:28] no, the only reason to do that would be rcs [17:13:35] and that will go away soon, I think [17:13:42] in fact, as a positive note [17:14:05] 6 month ago, if this had happened, serving db traffic with not special replicas [17:14:10] would have created lots of errors [17:14:22] this just confirmer those special groups are no longer needed [17:14:44] and this is the first large scale production test [17:14:48] proving that [17:15:14] cdanis: you can imagine how bad is it to maintain a special partitioning just on a few hosts [17:15:30] we just proved it was finally solved! [17:15:43] yeah that seems not fun :) [17:16:07] not only on pool/depools, on schema changes, agh [17:16:21] if things are stable [17:16:23] s1 seems happy still [17:16:25] I will check what actually happen [17:16:27] db1099 seems happy [17:17:08] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [17:17:29] mediawiki fatals look fine, there is a low background rate of timeouts [17:17:30] not sure if this is interesting, feel free to tell me off: https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1 [17:17:43] and the timeouts are in things like deep Parser invocations [17:17:45] but this slow query pattern fits perfectly whith the issue [17:17:55] yes, it does [17:21:48] <_joe_> cdanis: that's expected [17:21:57] _joe_: yep :) [17:22:07] <_joe_> some revisions fail to render within 60s [17:23:02] (03PS13) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:24:06] (03CR) 10Bstorm: "Daimona is this ready to go at this point? That is, should I do local testing to see how it plays with the scripts?" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [17:24:30] (03PS14) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:24:46] (03PS2) 10Dzahn: admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) [17:25:19] (03CR) 10Daimona Eaytoy: [C: 03+1] "> Daimona is this ready to go at this point? That is, should I do" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [17:26:21] (03PS15) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:28:36] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:31:51] (03PS16) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:32:09] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Remove php-xdebug from releases1001 / releases2001 - https://phabricator.wikimedia.org/T236774 (10hashar) 05Open→03Resolved Awesome thank and it is indeed totally gone. Thank you for the manual cleanup! Als... [17:33:52] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:34:40] !log phab2001 - upgrading PHP packages [17:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:21] (03PS17) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:35:54] (03PS1) 10Hashar: zuul: change gear queue monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/546989 [17:37:29] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:37:42] (03CR) 10Hashar: "I don't who knows best how to configure those monitoring probe. I have no idea why I had set it up at 30% previously, so it is really just" [puppet] - 10https://gerrit.wikimedia.org/r/546989 (owner: 10Hashar) [17:38:02] !log phab1001 - upgrading php7.3 packages [17:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:40] (03PS18) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:40:49] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:42:16] !log cutting branch for 1.35.0-wmf.4 [17:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:50] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:45:25] 10Operations: Upgrade CAS to 6.1.0 - https://phabricator.wikimedia.org/T236815 (10Dzahn) [17:46:00] 10Operations, 10serviceops: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) 05Open→03Resolved [17:46:10] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10hashar) Seems so yes :-] [17:48:39] (03PS19) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:56:17] (03PS3) 10Dzahn: site/install_server: decom ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/544080 (https://phabricator.wikimedia.org/T180641) [17:58:14] phab admin needed https://phabricator.wikimedia.org/p/LinkOsorio/ [17:59:56] (03PS1) 10Cwhite: profile: get exim metrics from lists [puppet] - 10https://gerrit.wikimedia.org/r/546992 (https://phabricator.wikimedia.org/T236505) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1800) [18:01:08] twentyafterfour: could you or someone disable https://phabricator.wikimedia.org/p/LinkOsorio/ [18:01:42] sorry to ping you, not sure where to report this [18:01:57] musikanimal: what's up? are they spamming? [18:02:01] vandalism [18:02:12] it's the highway vandal, a long-term abuser [18:02:22] first time I've seen them on phabricator [18:02:40] https://phabricator.wikimedia.org/T236825 says "eolgi", that's what gives it away [18:02:59] musikanimal: disabled [18:03:03] I'll get a steward to lock https://www.mediawiki.org/wiki/Special:CentralAuth/Link_Smurf.76 [18:03:05] thanks! [18:03:38] musikanimal: i see https://phabricator.wikimedia.org/feed/query/LzfyADOE4aQQ/#R [18:04:33] yeah I only noticed this because they assigned that task to me [18:04:46] I just blocked a slew of their socks on enwiki [18:04:52] ack [18:05:57] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce switch for etcd role [puppet] - 10https://gerrit.wikimedia.org/r/546995 (https://phabricator.wikimedia.org/T236826) [18:07:37] !log ppchelko@deploy1001 Started deploy [restbase/deploy@cf80130]: Mirror 10% of /page/html/ traffic to Parsoid/PHP T235902 [18:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:42] T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic - https://phabricator.wikimedia.org/T235902 [18:07:50] (03CR) 10Bstorm: "The thing we have to be careful of here is that the clusters must be separate and distinct. I think this will fill in /etc/default/etcd wi" [puppet] - 10https://gerrit.wikimedia.org/r/546995 (https://phabricator.wikimedia.org/T236826) (owner: 10Arturo Borrero Gonzalez) [18:09:49] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikidebate project - https://phabricator.wikimedia.org/T236829 (10Sophivorus) [18:10:19] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/546995 (https://phabricator.wikimedia.org/T236826) (owner: 10Arturo Borrero Gonzalez) [18:10:53] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikidebate project - https://phabricator.wikimedia.org/T236829 (10Sophivorus) [18:14:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:16:02] (03CR) 10Herron: "> Agreed, exim metrics would be good as well, but I think is a" [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [18:17:46] Pchelolo, looking good so far. [18:18:17] subbu: yeah. in RB logs the only new thing is "Missing Content-Language or Vary header in pb.body.html.headers" warning but I guess it's known? [18:18:37] (03CR) 10Herron: [C: 03+2] "thanks for this!" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 (owner: 10Hashar) [18:19:21] Pchelolo, is it one-off or a lot of those? [18:19:26] <_joe_> i see a lot of 412 [18:19:30] (03CR) 10Herron: [V: 03+2] Merge tag 'v0.3.1' [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546934 (owner: 10Hashar) [18:19:36] (03CR) 10Herron: [V: 03+2 C: 03+2] Merge tag 'v0.3.1' [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546934 (owner: 10Hashar) [18:19:37] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200&from=now-30m&to=now [18:19:46] subbu: 46 so far [18:20:21] ok .. arlo says they shouldn't be there .. but we'll look at that. [18:20:28] <_joe_> subbu: when does parsoid/php responds with 412? [18:20:31] can you file a phab task with any ohter info you have? including titles, etc. [18:20:35] arlo is looking. [18:20:54] <_joe_> I see a lot of such responses here https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=22&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200&from=now-30m&to=now [18:20:59] subbu: will file a task [18:21:04] ty [18:21:05] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10WDoranWMF) [18:21:29] <_joe_> precondition failed seems quite strange as a response [18:21:49] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@cf80130]: Mirror 10% of /page/html/ traffic to Parsoid/PHP T235902 (duration: 14m 13s) [18:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:55] T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic - https://phabricator.wikimedia.org/T235902 [18:21:55] <_joe_> oh maybe being read-only? [18:22:01] <_joe_> subbu: ^^ [18:22:10] also, the rate doesn't seem to be 10% https://grafana.wikimedia.org/d/000000048/parsoid-timing-wt2html?orgId=1&panelId=37&fullscreen&from=now-30m&to=now [18:22:29] (03CR) 10Revi: "@Masumrezarock100, (Urbanecm says) your CR+1 has reset WIP status of this patch and thus accidental merge of this patch, as you can see ab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546780 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [18:23:20] _joe_, Pchelolo arlo says there is no http 412 in parsoid codebase but there is one in core's rest handlers. so, we should look there. [18:25:00] <_joe_> subbu: lemme look into it [18:25:08] ok. [18:25:28] <_joe_> what is the task? [18:26:00] Pchelolo, yes .. it is a result of Parsoid/PHP being faster and changeprop restbase requests being a factor of response times. [18:26:49] <_joe_> subbu: the cpu barely noticed [18:26:53] subbu: no... that shouldn't be the case... we're mirroring traffic in restbase, not change-prop.. so it shouldn't matter [18:27:01] <_joe_> http://ca.wikipedia.org/w/rest.php/ca.wikipedia.org/v3/page/pagebundle/Vella_Gu\xc3\xa0rdia_(FET_y_de_las_JONS)/20962625 this is a request where we get a 412 [18:27:14] _joe_, ya .. small change in cpu / load. [18:27:28] <_joe_> but there are also ones with no special character [18:27:41] <_joe_> like http://it.wikipedia.org/w/rest.php/it.wikipedia.org/v3/page/pagebundle/Quinto_Servilio_Pudente/82702070 [18:28:08] hmm... i guess we need to see what restbase is posting to parsoid/php for those. [18:28:17] that is tripping the core rest handler [18:28:35] <_joe_> it seems pretty straightforward to me [18:29:22] Pchelolo, i see. reg. traffic rates. [18:29:28] <_joe_> there are a lot of OOMs though [18:29:44] <_joe_> well not so many [18:34:10] _joe_, ya .. see those in the parsoid-php kibana dashboard. [18:34:20] will have to see if our resource limits code is kicking in for those or not [18:34:45] Pchelolo, going back to your qn, maybe the mirroring code needs a fix? [18:34:51] <_joe_> subbu: we can raise the memory limit just on the parsoid cluster pretty easily [18:35:01] <_joe_> (also sorry in a meeting) [18:35:15] _joe_, no rush .. this is a test deploy .. so, nothing that needs fixing right away. [18:35:53] (03PS1) 10Brennen Bearnes: Group0 to 1.35.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547001 [18:36:58] subbu: by just looking at the code it seems correct.. Gimme a sec, I have an idea about what to check [18:37:11] k [18:42:38] subbu: ok I understand [18:43:47] we're mirroring all incoming traffic, both updates and normal traffic. but parsoid-php cassandra store is empty, thus 100% of mirrored requests to php is hitting parsoid, while for JS it's only update requests [18:44:16] thus the actual traffic to parsoid php is more then 10% of traffic to parsoid js [18:45:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:46:00] Pchelolo, ah .. ok. [18:46:24] and since you never store Parsoid/PHP output, this will not change. [18:46:50] but, at least good to know that even with 40 reqs/s, the cpu on eqiad barely bumps up. [18:47:27] Pchelolo, but, this needs fixing before we mirror more traffic [18:48:18] ok, so we need 3 tickets: one for vary header warning, 1 for 412 investigation and one for traffic disparity [18:48:34] Pchelolo, and one for http 412 [18:49:58] (03PS3) 10Bstorm: maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) [18:50:03] !log brennen@deploy1001 Pruned MediaWiki: 1.35.0-wmf.1 (duration: 08m 09s) [18:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:09] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10eprodromou) a:05EvanProdromou→03aaron So, I think with the work that @aaron has done on movin... [18:50:38] (03CR) 10Bstorm: maintain-kubeusers: add ability to merge and update configs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [18:53:20] !log reindexing Slovak wikis on elastic@eqiad and elastic@codfw (T235654) [18:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:25] T235654: Re-index Slovak Wikis to enable folding of Slovak diacritics after stemming - https://phabricator.wikimedia.org/T235654 [18:53:32] !log brennen@deploy1001 Started scap: testwiki to php-1.35.0-wmf.4 and rebuild l10n cache [18:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:40] !log jynus@cumin1001 dbctl commit (dc=all): 'Revert state to before overload+maintenance', diff saved to https://phabricator.wikimedia.org/P9501 and previous config saved to /var/cache/conftool/dbconfig/20191029-185438-jynus.json [18:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:52] (03CR) 10Herron: mtail,profile: add smtp metrics collection with mtail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [18:58:05] (03CR) 10Herron: profile: get exim metrics from lists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546992 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [18:58:19] (03PS2) 10Cwhite: profile: get exim metrics from lists [puppet] - 10https://gerrit.wikimedia.org/r/546992 (https://phabricator.wikimedia.org/T236505) [18:59:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10greg) >>! In T236518#5615669, @Dzahn wrote: > @Ottomata Thank you! No, i don't think it needs more discussion. B... [19:00:04] brennen and twentyafterfour: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T1900). [19:00:47] (03PS3) 10Cwhite: profile: get exim metrics from lists [puppet] - 10https://gerrit.wikimedia.org/r/546992 (https://phabricator.wikimedia.org/T236505) [19:05:41] (03CR) 10Faidon Liambotis: New esams stuff (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/545660 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [19:05:58] (03CR) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [19:06:39] (03PS3) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [19:07:47] (03PS1) 10Jcrespo: prometheus: Add new prometheus mysqld exporter group: test [puppet] - 10https://gerrit.wikimedia.org/r/547009 [19:08:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [19:09:13] (03CR) 10Thcipriani: [C: 03+1] "lgtm to get folks a group on the deployment host." [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:09:46] jouncebot: apparently, 2 [19:09:53] (03PS2) 10Jcrespo: prometheus: Add new prometheus mysqld exporter group: test [puppet] - 10https://gerrit.wikimedia.org/r/547009 [19:11:46] Pchelolo, _joe_ filed https://phabricator.wikimedia.org/T236833 for the oom [19:11:54] (03PS1) 10Jgreen: add check-endpoints to payments role in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/547011 [19:11:55] Pchelolo, you filing the other ones? [19:12:05] will get back to our offsite business now. [19:12:57] (03PS4) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [19:13:40] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Joe) We can raise the memory limit for parsoid-php a bit. We first need to find out if such requests use a reasonable amount of memory or not. [19:13:50] (03PS2) 10Jgreen: add check-endpoints to payments role in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/547011 (https://phabricator.wikimedia.org/T212252) [19:14:24] subbu: yeah, I'll file the rest [19:14:31] ty. [19:14:44] !log brennen@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.4 and rebuild l10n cache (duration: 21m 11s) [19:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:59] (03CR) 10jerkins-bot: [V: 04-1] add check-endpoints to payments role in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/547011 (https://phabricator.wikimedia.org/T212252) (owner: 10Jgreen) [19:15:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [19:15:37] _joe_, mutante Pchelolo but for that .. overall, i think the deploy gave us what we wanted .. the cluster is still up and we found a few bugs. [19:15:44] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Ottomata) @MoritzMuehlenhoff should Brian be in `wmf` or `nda` LDAP group? [19:15:49] going offline again now. [19:15:57] <_joe_> yeha I'm off as well [19:16:15] have a nice offsite subbu and a nice evening _joe_ [19:16:29] (03PS3) 10Jgreen: add check-endpoints to payments role in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/547011 (https://phabricator.wikimedia.org/T212252) [19:17:23] (03PS3) 10Ottomata: admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:17:49] (03PS1) 1020after4: Design microsite: Set the scap deploy_user to "deploy-design" [puppet] - 10https://gerrit.wikimedia.org/r/547014 [19:18:06] (03PS5) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [19:19:01] (03CR) 1020after4: "The change to use the deploy-design user is in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/547014" [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:19:05] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 to 1.35.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547001 (owner: 10Brennen Bearnes) [19:19:39] (03PS2) 1020after4: Design microsite: Set the scap deploy_user to "deploy-design" [puppet] - 10https://gerrit.wikimedia.org/r/547014 [19:19:57] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547001 (owner: 10Brennen Bearnes) [19:20:09] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [19:20:15] (03CR) 10Ottomata: [C: 03+2] admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:20:40] (03CR) 10Cwhite: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [19:24:10] (03CR) 10Herron: [C: 03+1] Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [19:25:03] (03PS6) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [19:25:26] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.4 [19:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:26] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal) [19:28:36] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) For the record: * https://gerrit.wikimedia.org/r/c/operations/puppet/+/546928 (and followup https://gerrit.wikimedia.org/r/c... [19:29:50] (03CR) 10Ottomata: "From Tyler's comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/546303" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [19:30:02] !log andrew@deploy1001 Started deploy [horizon/deploy@bab5d37]: (no justification provided) [19:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:27] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) I thought this error was due to a backup attempt, pre-patch. However, after I ran it manually, it failed again: ` 29-Oct 19... [19:31:19] (03CR) 10Bartosz Dziewoński: [C: 03+1] Re-enable mobile editor A/B testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546724 (https://phabricator.wikimedia.org/T236337) (owner: 10DLynch) [19:31:36] !log andrew@deploy1001 Finished deploy [horizon/deploy@bab5d37]: (no justification provided) (duration: 01m 35s) [19:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:46] !log restarting bacula-fd on install1002 T236406 [19:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:50] T236406: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 [19:34:46] (03PS1) 10Eevans: Migrate to Kask for Echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547022 (https://phabricator.wikimedia.org/T222851) [19:35:49] (03PS7) 10Joal: [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps [puppet] - 10https://gerrit.wikimedia.org/r/546966 (https://phabricator.wikimedia.org/T234333) [19:36:30] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) That wasn't enough: ` 158826 Full 0 0 Error 29-Oct-19 19:35 install1002.wikimedia.org-Monthly-1st-Wed... [19:36:39] 10Operations, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: New Service Request: wikifeeds - https://phabricator.wikimedia.org/T223469 (10Ottomata) @Joe can you assign this to someone if it is in the 'Doing' status? :) [19:37:12] (03CR) 1020after4: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [19:37:55] (03PS2) 10Eevans: Migrate to Kask for Echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547022 (https://phabricator.wikimedia.org/T222851) [19:38:07] !log andrew@deploy1001 Started deploy [horizon/deploy@dbe892e]: (no justification provided) [19:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:39] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Ottomata) a:03Anomie @Anomie, assigning to you as it see... [19:41:43] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, and 2 others: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691 (10Ottomata) Is there a plan to work on this in the near term? It needs triaged and assigned! :) [19:42:06] !log andrew@deploy1001 Finished deploy [horizon/deploy@dbe892e]: (no justification provided) (duration: 03m 59s) [19:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:47] (03PS3) 10Cwhite: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [19:44:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Ottomata) a:03Ottomata @Andrew, assigning to you as you seem to be leading this parent task. Feel free to undo or reassign as necessary. [19:44:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Ottomata) a:05Ottomata→03Andrew [19:47:26] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) I got it, it was the storage daemon that hadn't been restarted, not the clients (that is why the director could connect, but... [19:47:48] (03CR) 10Cwhite: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/19122/" [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [19:47:57] (03PS5) 10Cwhite: prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) [19:50:46] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Ottomata) p:05Triage→03High [19:51:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Ottomata) p:05Triage→03High a:03Ottomata [19:52:01] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:53:28] 10Operations: Upgrade CAS to 6.1.0 - https://phabricator.wikimedia.org/T236815 (10Ottomata) p:05Triage→03Normal [19:54:15] 10Operations, 10serviceops: Reimage mwdebug1002 and mw1317 - https://phabricator.wikimedia.org/T236806 (10Ottomata) p:05Triage→03Normal a:03jijiki @jijiki assigning to you but feel free to undo or re-assign. [20:01:34] (03CR) 10Jgreen: [C: 03+2] add check-endpoints to payments role in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/547011 (https://phabricator.wikimedia.org/T212252) (owner: 10Jgreen) [20:01:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Ottomata) @thcipriani can you help with https://gerrit.wikimedia.org/r/c/operations/puppet/+/547014? I'm not en... [20:02:28] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikidebate project - https://phabricator.wikimedia.org/T236829 (10Ottomata) @Dzahn should I just create this mailing list or is there an approval process? [20:02:42] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikidebate project - https://phabricator.wikimedia.org/T236829 (10Ottomata) p:05Triage→03High a:03Ottomata [20:06:43] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) 05Open→03Resolved All the servers in the above list are wiped and out of the racks. This can be resolve. [20:06:45] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061 (10Papaul) [20:12:23] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Papaul) [20:14:36] (03PS6) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:15:50] (03CR) 10Thcipriani: "We should add a keyholder::agents section to hieradata/role/common/deployment_server.yaml as well. That will cover the private half of the" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [20:18:54] (03PS7) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:18:56] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10thcipriani) >>! In T236518#5617222, @Ottomata wrote: > @thcipriani can you help with https://gerrit.wikimedia.org/r/c/operations/puppe... [20:19:03] !log reindexing Slovak wikis on elastic@eqiad and elastic@codfw complete (T235654) [20:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:09] T235654: Re-index Slovak Wikis to enable folding of Slovak diacritics after stemming - https://phabricator.wikimedia.org/T235654 [20:19:23] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: decommission cp3001 & cp3002 - https://phabricator.wikimedia.org/T94215 (10Papaul) 05Open→03Resolved This is complete [20:19:25] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [20:20:08] (03PS1) 10Cwhite: prometheus: update file count script to use single metric instance [puppet] - 10https://gerrit.wikimedia.org/r/547028 (https://phabricator.wikimedia.org/T236505) [20:24:34] (03PS8) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:24:56] (03CR) 10Cwhite: [C: 03+2] prometheus: update file count script to use single metric instance [puppet] - 10https://gerrit.wikimedia.org/r/547028 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [20:25:50] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission cp3007-cp3010 - https://phabricator.wikimedia.org/T208585 (10Papaul) [20:26:15] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission cp3007-cp3010 - https://phabricator.wikimedia.org/T208585 (10Papaul) 05Open→03Resolved a:03Papaul Complete [20:26:19] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [20:27:36] (03PS1) 10CDanis: grafana: double-proxy for wpt-graphite [puppet] - 10https://gerrit.wikimedia.org/r/547030 (https://phabricator.wikimedia.org/T231870) [20:28:42] (03CR) 10Cwhite: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [20:31:21] (03PS9) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:33:39] (03CR) 10CDanis: "PCC looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/547030 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [20:33:56] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) It looks like what we need to do is ensure that our outbound requests can use an... [20:37:25] (03PS10) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:43:43] !log rebooting cp3056 for HW check [20:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:56] (03PS11) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:45:25] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) Created new keypair for design and committed in private repo on the puppetmaster. ` remote: modules/secret/secrets/keyholde... [20:46:09] 10Operations, 10Wikimedia-Mailing-lists: Enable CAPTCHA on mailman instances - https://phabricator.wikimedia.org/T194558 (10colewhite) It looks like recaptcha was built in recently and is available in buster https://bugs.launchpad.net/mailman/+bug/1774826 [20:46:55] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) The passphrase to arm the key is the same for all deployment keys (since T154943). It's stored in pwstore in the file deploym... [20:48:45] (03CR) 10Dzahn: Design microsite: Set the scap deploy_user to "deploy-design" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [20:49:41] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10Papaul) @BBlack indeed I was getting an error on the PCIe card and I did removed/ insert it and was not getting the error anymore. Please try to re-image the server and let me know. Thanks. [20:50:09] (03PS1) 10Ottomata: statistics - rename published-datasets to just published [puppet] - 10https://gerrit.wikimedia.org/r/547041 (https://phabricator.wikimedia.org/T235494) [20:53:32] (03CR) 10Dzahn: "currently compiler fails with "invalid secret keyholder/deploy_design.pub". that means we have to add them in labs/private too but also i " [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [20:54:23] (03PS2) 10Ottomata: statistics - rename published-datasets to just published [puppet] - 10https://gerrit.wikimedia.org/r/547041 (https://phabricator.wikimedia.org/T235494) [20:54:56] (03PS12) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [20:58:16] (03PS3) 10Ottomata: statistics - rename published-datasets to just published [puppet] - 10https://gerrit.wikimedia.org/r/547041 (https://phabricator.wikimedia.org/T235494) [20:59:02] 10Operations, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10colewhite) a:03colewhite [20:59:14] 10Operations, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10colewhite) Done. [20:59:20] 10Operations, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10colewhite) 05Open→03Resolved [21:03:07] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) Renamed the keys to use the "deploy_" prefix. (Some keys are just called $service, some deploy_$services and some $service_depl... [21:06:21] 10Operations, 10ops-esams: Prepare racks OE14, OE15 and OE16 with new infrastructure - https://phabricator.wikimedia.org/T184064 (10Papaul) 05Open→03Resolved a:03Papaul This is complete [21:06:23] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061 (10Papaul) [21:08:19] (03CR) 10Gehel: "minor questions inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [21:09:28] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Izno) The Syrian civil war template is a known issue on wiki even with legacy parser. We've had to remove details at least once from the template before. Just fyi. [21:10:29] (03PS1) 10Dzahn: add fake deployment keys for new design group [labs/private] - 10https://gerrit.wikimedia.org/r/547044 (https://phabricator.wikimedia.org/T236518) [21:11:32] (03PS2) 10Dzahn: add fake deployment keys for new design group [labs/private] - 10https://gerrit.wikimedia.org/r/547044 (https://phabricator.wikimedia.org/T236518) [21:12:27] (03CR) 10Dzahn: [C: 03+2] add fake deployment keys for new design group [labs/private] - 10https://gerrit.wikimedia.org/r/547044 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [21:12:36] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake deployment keys for new design group [labs/private] - 10https://gerrit.wikimedia.org/r/547044 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [21:12:45] (03CR) 10Gehel: Introduce Elastic 7 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [21:13:53] (03CR) 10Dzahn: Design microsite: Set the scap deploy_user to "deploy-design" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [21:14:30] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/19127/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/547014 (owner: 1020after4) [21:19:25] (03CR) 10Dzahn: [C: 03+2] site/install_server: decom ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/544080 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [21:22:45] (03PS1) 10Dzahn: mariadb::ferm_misc: remove ununpentium from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/547045 (https://phabricator.wikimedia.org/T236748) [21:23:29] (03PS2) 10Dzahn: mariadb::ferm_misc: remove ununpentium from allowed hosts for RT [puppet] - 10https://gerrit.wikimedia.org/r/547045 (https://phabricator.wikimedia.org/T236748) [21:26:55] (03CR) 10Dzahn: [C: 03+2] mariadb::ferm_misc: remove ununpentium from allowed hosts for RT [puppet] - 10https://gerrit.wikimedia.org/r/547045 (https://phabricator.wikimedia.org/T236748) (owner: 10Dzahn) [21:27:18] (03CR) 10Dzahn: "this is a noop again" [puppet] - 10https://gerrit.wikimedia.org/r/547045 (https://phabricator.wikimedia.org/T236748) (owner: 10Dzahn) [21:30:04] (03CR) 10Dzahn: "nevermind, behaves as expected on dbproxy1001" [puppet] - 10https://gerrit.wikimedia.org/r/547045 (https://phabricator.wikimedia.org/T236748) (owner: 10Dzahn) [21:32:01] (03PS1) 10Dzahn: remove ununpentium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/547047 (https://phabricator.wikimedia.org/T236748) [21:35:41] (03CR) 10Subramanya Sastry: [C: 03+1] Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 (owner: 10Krinkle) [21:43:04] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:46:19] (03PS1) 10Dzahn: requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) [21:46:21] (03PS1) 10Dzahn: requesttracker: remove rsync and special case for jessie [puppet] - 10https://gerrit.wikimedia.org/r/547052 (https://phabricator.wikimedia.org/T180641) [21:48:22] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [21:49:54] (03PS2) 10Dzahn: requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) [21:50:49] (03CR) 10Alex Monk: "require_package doesn't produce duplicate resources - see modules/wmflib/lib/puppet/parser/functions/require_package.rb" [puppet] - 10https://gerrit.wikimedia.org/r/545081 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [21:51:27] netbox report is me, having issues bringing a node online, so it's in a strange state [21:52:00] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [21:54:57] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) [21:55:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) [21:59:34] (03PS3) 10Dzahn: requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) [21:59:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) [22:00:18] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:01:30] (03PS4) 10Dzahn: requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) [22:03:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19130/moscovium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:03:47] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:03:58] (03PS5) 10Dzahn: requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) [22:05:28] (03PS6) 10Dzahn: requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) [22:05:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) I updated the SSH-key. [22:10:39] (03CR) 10Dzahn: [C: 03+2] requesttracker: rsync the database dump to new server [puppet] - 10https://gerrit.wikimedia.org/r/547051 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:11:02] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:14:18] !log rsynced data dump and config from ununpentium to moscovium in /srv/ before shutting down the old server (T180641) [22:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:23] T180641: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 [22:17:46] !log ununpentium - shutdown Ganeti VM - running decom script, schedule icinga downtime (T236748) [22:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:51] T236748: decom ununpentium - https://phabricator.wikimedia.org/T236748 [22:19:55] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Nuria) @CGlenn : see my comment above, we need approval from your manager in phab. [22:20:33] PROBLEM - Host cp3056 is DOWN: PING CRITICAL - Packet loss = 100% [22:21:05] ^ still me [22:21:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) @Nuria Yes, absolutely. I am working on getting my manager to create an account in phab. He is new. [22:24:37] RECOVERY - Host cp3056 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [22:24:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [22:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:06] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [22:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:12] 10Operations, 10Patch-For-Review: decom ununpentium - https://phabricator.wikimedia.org/T236748 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `ununpentium.wikimedia.org` - ununpentium.wikimedia.org (**FAIL**) - Downtimed host on Icinga - No management inter... [22:32:31] 10Operations, 10Patch-For-Review: decom ununpentium - https://phabricator.wikimedia.org/T236748 (10Dzahn) ^ Gone from Icinga. Status in Netbox was "offline" without manual intervention and there is no other status that would fit decom more. It was already shutdown in Ganeti hence the shutdown error. [22:36:48] (03PS2) 10Dzahn: requesttracker: remove rsync and special cases for jessie [puppet] - 10https://gerrit.wikimedia.org/r/547052 (https://phabricator.wikimedia.org/T180641) [22:39:55] (03CR) 10Dzahn: [C: 03+2] requesttracker: remove rsync and special cases for jessie [puppet] - 10https://gerrit.wikimedia.org/r/547052 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:40:06] (03PS3) 10Dzahn: requesttracker: remove rsync and special cases for jessie [puppet] - 10https://gerrit.wikimedia.org/r/547052 (https://phabricator.wikimedia.org/T180641) [22:43:00] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10serviceops: How should the MachineVision extension interact with external APIs from production? - https://phabricator.wikimedia.org/T236797 (10Mholloway) I see an `http_proxy` value defined in hieradata/common.yaml, but is it exposed... [22:53:57] (03CR) 10Dzahn: [C: 03+2] remove ununpentium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/547047 (https://phabricator.wikimedia.org/T236748) (owner: 10Dzahn) [22:54:49] 10Operations, 10Patch-For-Review: decom ununpentium - https://phabricator.wikimedia.org/T236748 (10Dzahn) 05Open→03Resolved [22:55:52] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn) RT is now on buster on moscovium and using a private IP. ununpentium the jessie machine has been decom'ed. [22:56:54] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [22:56:58] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn) [22:57:09] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191029T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:18] (03CR) 10Urbanecm: [C: 03+2] Revert "Milestone lobo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546912 (https://phabricator.wikimedia.org/T236777) (owner: 10Urbanecm) [23:01:24] * Urbanecm is deploying [23:02:02] (03Merged) 10jenkins-bot: Revert "Milestone lobo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546912 (https://phabricator.wikimedia.org/T236777) (owner: 10Urbanecm) [23:03:06] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10RobH) 05Open→03Resolved all green in icinga, resolving task [23:03:08] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [23:03:36] (03CR) 10Dzahn: "this is the usual problem of almost any monitoring::graphite_threshold check. always seems good in theory but finding the thresholds is pr" [puppet] - 10https://gerrit.wikimedia.org/r/546989 (owner: 10Hashar) [23:04:31] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: f7b9972: Revert "Milestone lobo for atjwiki" (T236777) (duration: 01m 01s) [23:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:37] T236777: Revert celebration logo for atjwiki - https://phabricator.wikimedia.org/T236777 [23:04:51] (03CR) 10Dzahn: [C: 03+2] zuul: change gear queue monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/546989 (owner: 10Hashar) [23:05:41] !log Purge https://en.wikipedia.org/static/images/project-logos/atjwiki* (T236777) [23:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:54] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:58] !log Evening SWAT done [23:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:39] (03CR) 10Dzahn: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:06:56] (03CR) 10Dzahn: [C: 03+1] mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:07:39] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) I've tried imaging, and things mostly work, but I have a hard time keeping it online long enough to get through an initial puppet agent run (or two or three), as the kernel keeps pan... [23:09:04] !log ganeti1003 - gnt-instance remove ununpentium.wikimedia.org (T236748) [23:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:08] T236748: decom ununpentium - https://phabricator.wikimedia.org/T236748 [23:10:42] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:34] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10Dzahn) " lvs parsoid-php workaround " https://gerrit.wikimedia.org/r/c/operations/puppet/+/545619 [23:13:25] 10Operations, 10serviceops: Set up LVS for parsoid/PHP - https://phabricator.wikimedia.org/T233722 (10Dzahn) 05Open→03Resolved a:05Dzahn→03Joe [23:13:28] 10Operations, 10serviceops: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [23:13:54] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Dzahn) a:05Ottomata→03Dzahn [23:28:11] (03PS13) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [23:36:16] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:55] (03CR) 10Bstorm: "I also have now applied this all to Id2adcd74b7415648b3bafe17980 as well" [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [23:41:02] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wtp1025.eqiad.wmnet,service=parsoid-php [23:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:33] (03CR) 10BryanDavis: [C: 03+1] maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [23:52:38] (03PS4) 10Bstorm: maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) [23:53:43] (03CR) 10Bstorm: [C: 03+2] maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm)