[00:17:07] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10KFrancis) @Dzahn Thanks!!! I'll confirm when the NDA is fully executed. [00:21:36] (03PS1) 10Dzahn: planet: removing a few more broken feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/609871 (https://phabricator.wikimedia.org/T168459) [00:33:27] (03CR) 10Dzahn: planet: removing a few more broken feed URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609871 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [00:34:35] (03CR) 10Dzahn: planet: removing a few more broken feed URLs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609871 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [00:34:43] (03CR) 10Dzahn: [C: 03+2] planet: removing a few more broken feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/609871 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [00:42:48] (03PS1) 10Dzahn: zuul: remove gerrit-test connection and setup [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) [00:51:19] (03PS1) 10Dzahn: acme_chief: remove gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/609878 (https://phabricator.wikimedia.org/T239151) [00:53:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:54:05] (03PS1) 10Dzahn: site/DHCP/partman: decom gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) [00:55:23] (03PS2) 10Dzahn: acme_chief: remove gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/609878 (https://phabricator.wikimedia.org/T239151) [00:57:40] (03PS1) 10Dzahn: gerrit: stop rsyncing to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609883 (https://phabricator.wikimedia.org/T239151) [00:58:56] 10Operations, 10Traffic: HTML Dumps 429 error on RESTBase endpoints - https://phabricator.wikimedia.org/T255524 (10RBrounley_WMF) 05Open→03Resolved [00:59:27] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:59:37] (03CR) 10Dzahn: "this will be the last step to go with running a cookbook to remove the VM" [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [01:04:18] (03PS1) 10Dzahn: mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) [01:07:56] (03PS1) 10Dzahn: remove gerrit-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/609886 (https://phabricator.wikimedia.org/T239151) [01:07:58] (03PS1) 10Dzahn: remove gerrit1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/609887 (https://phabricator.wikimedia.org/T239151) [01:14:35] (03PS1) 10Dzahn: annualreport: update redirect from 2018 to 2019 report [puppet] - 10https://gerrit.wikimedia.org/r/609888 (https://phabricator.wikimedia.org/T257257) [01:18:59] 10Operations, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to latest Annual Report - https://phabricator.wikimedia.org/T257257 (10Dzahn) [01:19:02] 10Operations, 10WMF-Annual-Report, 10serviceops, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10Dzahn) [01:20:09] 10Operations, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2019 Annual Report - https://phabricator.wikimedia.org/T257257 (10Dzahn) [01:40:15] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.131e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:42:28] (03PS1) 10Dzahn: httpbb: update test case for annual.wikimedia.org to 2019 [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) [02:06:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.35.0-wmf.40 [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/609892 [02:29:52] (03PS1) 10Catrope: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) [02:32:13] * Krinkle stares at P3P http response headers from CentralAuth [02:38:29] (03PS1) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [02:38:36] (03CR) 10jerkins-bot: [V: 04-1] Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [02:39:51] (03PS2) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [02:39:58] (03CR) 10jerkins-bot: [V: 04-1] Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [02:41:55] (03PS3) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [02:50:25] (03PS4) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [02:57:01] (03PS5) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [04:04:08] (03PS2) 10DannyS712: Branch commit for wmf/1.35.0-wmf.40 [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/609892 (https://phabricator.wikimedia.org/T256668) (owner: 10TrainBranchBot) [04:15:28] 10Operations, 10DBA: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) Thank you, I can see it now! ` => controller all show status Smart Array P840 in Slot 1 Controller Status: OK Cache Status: Not Configured Battery/Capacitor Status: OK => ` I have start... [04:16:08] (03PS1) 10Marostegui: Revert "db1079: Add broken BBU status" [puppet] - 10https://gerrit.wikimedia.org/r/609638 [04:16:50] (03CR) 10Marostegui: [C: 03+2] Revert "db1079: Add broken BBU status" [puppet] - 10https://gerrit.wikimedia.org/r/609638 (owner: 10Marostegui) [04:17:45] (03PS3) 10Marostegui: mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) [04:18:01] (03PS4) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) [04:27:23] (03CR) 10Marostegui: mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [04:27:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [04:36:04] (03CR) 10Marostegui: "I guess we should remove grants and/or rename its tables to make sure nothing really uses it, and if so, and after a few days, taking a fi" [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [04:38:36] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) The hardware has been budgeted. We are expecting to buy it in Q2, is that ok with you?... [04:50:26] we will disable es5 writes in 10 minutes to do a switchover on its primary master, if all goes well, it should be transparent for everyone [04:58:59] (03CR) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [04:59:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [05:00:04] marostegui, jynus, and kormat: I, the Bot under the Fountain, allow thee, The Deployer, to do es5 database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T0500). [05:00:13] o/ [05:00:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [05:00:25] o/ [05:00:48] Let's go then kormat and jynus ? [05:00:53] ok [05:01:03] Going to deploy the writes disablement [05:01:29] !log "Starting es failover from es1023 to es1024 - https://phabricator.wikimedia.org/T255755" [05:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:01] marostegui: +1 [05:02:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Disable es5 writes T255755 (duration: 00m 56s) [05:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:40] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [05:02:45] Done, let' see if the master stops having activity [05:02:56] Also monitoring errors [05:03:34] the master has stopped having wikiuser connections [05:03:42] no erros on es servers, can also confirm that [05:03:47] let me check write stats [05:04:04] Checking https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-5m&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=es1023&var-port=9104 [05:04:09] There is a drop there too [05:04:20] only one 1 insert and 1 replace per second [05:04:26] probably heartbeat [05:04:29] one is the hearbeart I guess [05:04:37] let me check binlog [05:04:51] 1 replace per second now only [05:05:30] yeah, binlog looks clean [05:05:34] only heartbeat [05:05:49] Going to go ahead and do the switchover [05:07:09] btw, once we are done, I will document this process, as it is a bit different from the normal sX switchover [05:07:22] great :) [05:07:23] replication changed [05:08:00] I think tendril needs manual fixing [05:08:33] doing [05:08:37] thanks [05:08:56] replication_tree looks good though [05:09:10] going to start making dbctl changes [05:10:38] I updated tendril, can you paste the script's output to debug at a later time? [05:10:50] Updating tendril... [05:10:50] [WARNING] Old master not found on tendril server list [05:11:04] ok ,thanks [05:11:13] I am changing dbctl, as it looks like es1024 needs to have the candidate_master flag enabled to be able to be promoted to master [05:11:16] zarcillo did not complain? [05:11:48] nope [05:12:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1024 to es5 master T255755', diff saved to https://phabricator.wikimedia.org/P11758 and previous config saved to /var/cache/conftool/dbconfig/20200707-051236-marostegui.json [05:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:42] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [05:13:17] running puppet [05:13:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool cluster27 (es5) from writes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609639 [05:15:20] Let's enable es5 back then? [05:15:46] I am going to leave replication stopped on the old master (es1023 running 10.1) [05:16:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1023 entirely T255755', diff saved to https://phabricator.wikimedia.org/P11759 and previous config saved to /var/cache/conftool/dbconfig/20200707-051620-marostegui.json [05:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:58] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool cluster27 (es5) from writes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609639 (owner: 10Marostegui) [05:17:31] I have issues accessing wmf network, but it is posible just me [05:17:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool cluster27 (es5) from writes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609639 (owner: 10Marostegui) [05:17:42] which network? [05:17:49] possible ddos [05:17:55] nice [05:17:58] (03PS1) 10Marostegui: wmnet: Update es5-master alias [dns] - 10https://gerrit.wikimedia.org/r/609899 (https://phabricator.wikimedia.org/T255755) [05:18:01] good timing [05:18:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:18:57] PROBLEM - LVS ncredir esams port 80/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:18:59] leaving es5 disabled for a few more minutes should be no issue [05:19:15] mmm? [05:19:25] (03CR) 10Kormat: [C: 03+1] wmnet: Update es5-master alias [dns] - 10https://gerrit.wikimedia.org/r/609899 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [05:19:31] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ProtocolError(Connection aborted., ConnectionResetError(104, Connection reset by peer)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [05:19:44] <_joe_> we have some issues in esams? [05:19:49] RECOVERY - LVS ncredir esams port 80/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:19:57] <_joe_> I can surf just fine [05:20:02] yeah, me too [05:20:17] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:20:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:27] recovered already [05:20:30] <_joe_> I would suppose a network hiccup [05:20:44] <_joe_> let's try to confirm once I've had coffee [05:24:15] see _security actually [05:26:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Enable es5 writes T255755 (duration: 00m 56s) [05:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:41] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [05:26:44] es5 enabled again [05:27:05] checking [05:27:13] writes appearing on binlog [05:28:04] waiting for prometheus lag [05:28:44] writes back high for es1024/es5 [05:28:59] yeah, binlog looking good [05:29:46] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es5-master alias [dns] - 10https://gerrit.wikimedia.org/r/609899 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [05:30:24] (03CR) 10Ayounsi: [C: 03+1] "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [05:31:45] so far, no errors [05:34:21] (03PS1) 10Marostegui: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609900 (https://phabricator.wikimedia.org/T255755) [05:35:08] (03CR) 10Marostegui: [C: 03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609900 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [05:37:43] everything looks nominal for es [05:52:39] (03PS1) 10Marostegui: install_server: Reimage es1023 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/609904 (https://phabricator.wikimedia.org/T255755) [05:53:18] I will be checking what was the issue with tendril [05:54:43] thank you [05:54:55] I see what, tendril and zarcillo are expected to be on the same host [05:55:06] so last time zarcillo failed [05:55:10] and this time it was tendril [05:55:12] fixing [05:55:48] I wrongly merged https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/607725 [05:58:28] (03PS1) 10Jcrespo: switchover.py: Split zarcillo and tendril reference [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/609905 [05:58:37] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Split zarcillo and tendril reference [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/609905 (owner: 10Jcrespo) [06:00:20] (03PS2) 10Jcrespo: switchover.py: Split zarcillo and tendril reference [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/609905 [06:05:36] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage es1023 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/609904 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [06:06:57] (03CR) 10Marostegui: [C: 03+1] switchover.py: Split zarcillo and tendril reference [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/609905 (owner: 10Jcrespo) [06:10:08] (03CR) 10Jcrespo: [C: 03+2] switchover.py: Split zarcillo and tendril reference [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/609905 (owner: 10Jcrespo) [06:10:48] ^ marostegui make sure to rebase where you use it so the bug is fixed for next run [06:12:22] done! [06:12:23] thanks [06:18:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079 T257216', diff saved to https://phabricator.wikimedia.org/P11760 and previous config saved to /var/cache/conftool/dbconfig/20200707-061849-marostegui.json [06:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:56] T257216: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 [06:20:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1136 some weight back into main traffic T257216', diff saved to https://phabricator.wikimedia.org/P11761 and previous config saved to /var/cache/conftool/dbconfig/20200707-062008-marostegui.json [06:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:19] PROBLEM - ores on ores2008 is CRITICAL: connect to address 10.192.48.89 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:29:28] !log Reimage es1023 to Buster T255755 [06:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:33] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [06:30:16] (03PS1) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [06:30:27] (03PS2) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [06:30:30] (03CR) 10jerkins-bot: [V: 04-1] Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [06:33:20] (03PS3) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [06:34:54] RECOVERY - ores on ores2008 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:35:53] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10jcrespo) One question, @guergana.tzatchkova you are only asking access to the wmde group, not the nda one, correct (no access to logstash, only the gerrit repos)? U... [06:37:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079 and give more main weight to db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11762 and previous config saved to /var/cache/conftool/dbconfig/20200707-063737-marostegui.json [06:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:43] T257216: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 [06:58:57] (03PS7) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [07:02:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [07:05:19] (03CR) 10Elukey: [C: 03+2] Remove apt::pin for python3-prometheus-client-package [puppet] - 10https://gerrit.wikimedia.org/r/609420 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:06:13] it looks like no one’s using the deployment server or mwdebug1001 at the moment, so I’ll be testing some config changes for T257266 there [07:06:14] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [07:06:16] hope that’s okay [07:06:26] (read: ping me if it’s not ^^) [07:07:40] (03PS1) 10Elukey: Move analytics1041 (Druid test node) to Buster [puppet] - 10https://gerrit.wikimedia.org/r/609913 [07:08:13] pulled four reverts to mwdebug1001, testing [07:08:27] (03CR) 10Elukey: [C: 03+2] Move analytics1041 (Druid test node) to Buster [puppet] - 10https://gerrit.wikimedia.org/r/609913 (owner: 10Elukey) [07:09:20] (03PS1) 10Elukey: Revert "Set BigTop for Hadoop master/standby/worker nodes." [puppet] - 10https://gerrit.wikimedia.org/r/609640 [07:09:50] (03CR) 10Elukey: [C: 03+2] Revert "Set BigTop for Hadoop master/standby/worker nodes." [puppet] - 10https://gerrit.wikimedia.org/r/609640 (owner: 10Elukey) [07:10:05] test looks very promising, uploading the reverts to gerrit and then syncing them [07:10:24] <_joe_> !log restart restbase on restbase1025 to pick up the switch to https for cxserver [07:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:40] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Wikibase: Remove config option wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609641 (https://phabricator.wikimedia.org/T241975) [07:11:53] (03PS1) 10Marostegui: install_server: Do not format es10[12]* and es20[12]* [puppet] - 10https://gerrit.wikimedia.org/r/609914 (https://phabricator.wikimedia.org/T255755) [07:12:13] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Wikibase: Remove config option wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609641 (https://phabricator.wikimedia.org/T241975) [07:12:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Wikibase: Remove config option wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609641 (https://phabricator.wikimedia.org/T241975) (owner: 10Lucas Werkmeister (WMDE)) [07:13:13] (03Merged) 10jenkins-bot: Revert "Wikibase: Remove config option wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609641 (https://phabricator.wikimedia.org/T241975) (owner: 10Lucas Werkmeister (WMDE)) [07:13:26] (03CR) 10Elukey: systemd/slice: Install systemd 241 from component/systemd241 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:15:01] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:609641|Revert "Wikibase: Remove config option wmgUseEntitySourceBasedFederation" (T241975, T257266)]] (duration: 00m 57s) [07:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:07] T241975: entitysources: Remove old MultiRepository & PerRepository Service containers and config - https://phabricator.wikimedia.org/T241975 [07:15:07] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [07:15:41] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Wikibase: stop using wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609642 (https://phabricator.wikimedia.org/T241975) [07:15:53] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Wikibase: stop using wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609642 (https://phabricator.wikimedia.org/T241975) [07:15:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Wikibase: stop using wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609642 (https://phabricator.wikimedia.org/T241975) (owner: 10Lucas Werkmeister (WMDE)) [07:16:24] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [07:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:36] (03CR) 10Kormat: [C: 03+1] "LGTM, though i'd prefer to make this not try to be so specific." [puppet] - 10https://gerrit.wikimedia.org/r/609914 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [07:16:46] (03Merged) 10jenkins-bot: Revert "Wikibase: stop using wmgUseEntitySourceBasedFederation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609642 (https://phabricator.wikimedia.org/T241975) (owner: 10Lucas Werkmeister (WMDE)) [07:17:38] (03CR) 10Marostegui: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/609914 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [07:19:10] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:609642|Revert "Wikibase: stop using wmgUseEntitySourceBasedFederation" (T241975, T257266)]] (duration: 00m 55s) [07:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:48] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Wikidata client wikis: Define entity sources configuration (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609643 (https://phabricator.wikimedia.org/T254315) [07:19:58] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Wikidata client wikis: Define entity sources configuration (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609643 (https://phabricator.wikimedia.org/T254315) [07:20:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Wikidata client wikis: Define entity sources configuration (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609643 (https://phabricator.wikimedia.org/T254315) (owner: 10Lucas Werkmeister (WMDE)) [07:20:48] (03Merged) 10jenkins-bot: Revert "Wikidata client wikis: Define entity sources configuration (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609643 (https://phabricator.wikimedia.org/T254315) (owner: 10Lucas Werkmeister (WMDE)) [07:21:24] (03PS2) 10Muehlenhoff: systemd/slice: Install systemd 241 from component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) [07:21:53] (03CR) 10Muehlenhoff: systemd/slice: Install systemd 241 from component/systemd241 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:23:13] !log lucaswerkmeister-wmde@deploy1001 Synchronized dblists/wikidataclient.dblist: Config: [[gerrit:609643|Revert "Wikidata client wikis: Define entity sources configuration (take 2)" (T254315, T257266)]] (duration: 00m 56s) [07:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:18] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [07:23:19] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [07:24:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [07:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:23] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: Config: [[gerrit:609643|Revert "Wikidata client wikis: Define entity sources configuration (take 2)" (T254315, T257266)]] (duration: 00m 56s) [07:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:06] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [07:25:10] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Commons: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609644 (https://phabricator.wikimedia.org/T256906) [07:25:20] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Commons: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609644 (https://phabricator.wikimedia.org/T256906) [07:25:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Commons: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609644 (https://phabricator.wikimedia.org/T256906) (owner: 10Lucas Werkmeister (WMDE)) [07:26:10] (03Merged) 10jenkins-bot: Revert "Commons: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609644 (https://phabricator.wikimedia.org/T256906) (owner: 10Lucas Werkmeister (WMDE)) [07:27:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079 and give more main weight to db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11764 and previous config saved to /var/cache/conftool/dbconfig/20200707-072703-marostegui.json [07:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:14] T257216: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 [07:27:45] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:609644|Revert "Commons: Define entity sources configuration" (T256906, T256907, T256909, T254315, T257266)]] (duration: 00m 53s) [07:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:53] T256909: Call to a member function getSourceName() on null, when deploying entity source config to wikidata clients - https://phabricator.wikimedia.org/T256909 [07:27:53] T256906: No namespace configured for MediaInfo entities, when deploying entity source config to wikidata clients - https://phabricator.wikimedia.org/T256906 [07:27:53] T256907: Call to a member function getDatabaseName() on null, when deploying entity source config to wikidata clients - https://phabricator.wikimedia.org/T256907 [07:27:58] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro [07:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:57] <_joe_> jayme: it seems restbase fails to work with proton via https [07:29:10] <_joe_> I think you did set that up, correct? [07:30:43] proton is chromium-render, right? [07:30:47] <_joe_> yes [07:30:49] (03PS4) 10Muehlenhoff: ntp: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605071 [07:30:49] (sorry, no coffee yet) [07:31:05] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:609644|Revert "Commons: Define entity sources configuration" (T256906, T256907, T256909, T254315, T257266)]] (forgot to git rebase so the last sync was a no-op) (duration: 00m 56s) [07:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:13] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [07:31:13] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [07:31:56] _joe_: I don't think that is has TLS LVS right now [07:31:58] <_joe_> !log restarting restbase on restbase1025, reaching proton via envoy for now [07:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:06] <_joe_> jayme: how can I reach it then? [07:32:39] _joe_: good question https://phabricator.wikimedia.org/T255877 [07:32:42] <_joe_> curl https://proton.discovery.wmnet:4030/?spec works [07:33:34] _joe_: fbee4a768b10e7e405e6ecf64ace2062004c5c36 in puppet [07:34:12] looks like Alex set it up as part of T225680 [07:34:13] T225680: Migrate Proton to k8s - https://phabricator.wikimedia.org/T225680 [07:34:33] <_joe_> ipvsadm -Lt 10.2.2.21:4030 says it's up on lvs1015 indeed :) [07:35:10] <_joe_> so I get 500s via envoy too [07:35:21] <_joe_> I guess there is something "funny" [07:36:07] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) Proton has an TLS LVS already via fbee4a768b10e7e405e6ecf64ace2062004c5c36 / T225680 [07:36:20] <_joe_> possibly these time out with the default timeout we have to the backend? [07:36:40] <_joe_> I have no idea, but it needs to be debugged [07:36:52] <_joe_> I'm switching back to non-https for proton :/ [07:37:12] (03PS1) 10Kormat: es2021: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609965 (https://phabricator.wikimedia.org/T257284) [07:37:32] (03CR) 10Marostegui: [C: 03+1] es2021: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609965 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:38:01] _joe_: in T225680 the bulletpoint "Switchover traffic from proton hosts to kubernetes" is still unchecked... [07:38:02] 20s? yikes [07:38:22] marostegui: just how much caffeine have you had this morning? :) [07:38:31] (03CR) 10Kormat: [C: 03+2] es2021: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609965 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:38:32] <_joe_> ohhh so the http endpoint still goes to the old machines [07:38:32] PROBLEM - Stale file for node-exporter textfile in eqiad on icinga1001 is CRITICAL: cluster=analytics file=nic_firmware.prom instance=analytics1030 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:38:41] kormat: none! [07:38:45] _joe_: guess so [07:38:48] marostegui: that's... even scarier [07:38:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:04] <_joe_> yeah jayme that's it [07:39:06] <_joe_> ok thanks [07:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1079 and db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11765 and previous config saved to /var/cache/conftool/dbconfig/20200707-073918-marostegui.json [07:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:22] T257216: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 [07:39:23] * jayme getting some coffee now [07:40:14] 10Operations, 10DBA: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) 05Open→03Resolved db1079 fully repooled, db1136 also got its original weight restored. All done! Thanks you John for replacing the BBU so fast! [07:40:17] (03PS1) 10Kormat: install_server: Switch es2021 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/609966 (https://phabricator.wikimedia.org/T257284) [07:40:57] (03PS1) 10Giuseppe Lavagetto: restbase: still use non-encrypted proton endpoint [puppet] - 10https://gerrit.wikimedia.org/r/609967 [07:41:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: still use non-encrypted proton endpoint [puppet] - 10https://gerrit.wikimedia.org/r/609967 (owner: 10Giuseppe Lavagetto) [07:41:42] marostegui: it's been over a minute this time - are you ok? ;) [07:42:02] (03CR) 10Marostegui: [C: 03+1] install_server: Switch es2021 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/609966 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:42:05] I was trying to behave [07:42:08] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] restbase: still use non-encrypted proton endpoint [puppet] - 10https://gerrit.wikimedia.org/r/609967 (owner: 10Giuseppe Lavagetto) [07:42:13] ah haha [07:42:27] (03CR) 10Kormat: [C: 03+2] install_server: Switch es2021 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/609966 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:43:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:44:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 for schema change', diff saved to https://phabricator.wikimedia.org/P11766 and previous config saved to /var/cache/conftool/dbconfig/20200707-074435-marostegui.json [07:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:20] <_joe_> !log restarting restbase again on rb1025 [07:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:37] jouncebot: now [07:45:37] No deployments scheduled for the next 3 hour(s) and 14 minute(s) [07:48:29] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 44 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:50:21] !log Stop MySQL on db1074 to deploy schema change and remove triggers - T238966 [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:26] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [07:51:24] I will upgrade and restart the CI Jenkins in 10 minutes [07:51:27] for a new LTS version [07:51:41] jobs will eventually restart automatically [07:51:52] (03CR) 10Privacybatm: [C: 04-1] Transferer.py: Calculate source checksum parallel to the data transfer (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [07:52:34] (03PS1) 10Addshore: Wikibase: stop using wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609645 (https://phabricator.wikimedia.org/T241975) [07:53:06] (03PS1) 10Addshore: Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609986 (https://phabricator.wikimedia.org/T241975) [07:53:13] (03PS2) 10Addshore: Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609986 (https://phabricator.wikimedia.org/T241975) [07:53:47] (03PS1) 10Addshore: Commons: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609987 [07:54:01] (03PS2) 10Addshore: Commons: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609987 (https://phabricator.wikimedia.org/T256906) [07:54:08] (03PS3) 10Addshore: Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609986 (https://phabricator.wikimedia.org/T241975) [07:54:11] (03PS2) 10Addshore: Wikibase: stop using wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609645 (https://phabricator.wikimedia.org/T241975) [07:54:18] (03CR) 10Addshore: [V: 04-1 C: 04-2] Commons: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609987 (https://phabricator.wikimedia.org/T256906) (owner: 10Addshore) [07:54:28] (03PS1) 10Marostegui: es1023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609969 (https://phabricator.wikimedia.org/T255755) [07:55:06] (03PS1) 10Addshore: Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) [07:55:12] (03PS2) 10Addshore: Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) [07:55:16] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [07:55:20] (03CR) 10Addshore: [V: 04-1 C: 04-2] Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [07:55:22] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [07:55:36] (03PS3) 10Addshore: Wikibase: stop using wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609645 (https://phabricator.wikimedia.org/T241975) [07:55:41] (03PS4) 10Addshore: Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609986 (https://phabricator.wikimedia.org/T241975) [07:55:43] (sorry for the spam) [07:58:01] (03PS3) 10Addshore: Commons: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609987 (https://phabricator.wikimedia.org/T256906) [07:58:29] (03PS3) 10Addshore: Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) [07:59:24] (03CR) 10Marostegui: [C: 03+2] es1023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609969 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [08:08:47] (03PS1) 10Addshore: Enable sitelinks to testcommons from test wikidata sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) [08:09:15] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool es2021 for reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11767 and previous config saved to /var/cache/conftool/dbconfig/20200707-080914-kormat.json [08:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:20] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [08:10:41] (03PS2) 10Addshore: Enable sitelinks to testcommons from test wikidata sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) [08:12:17] !log cr4-ulsfo> request vmhost snapshot - T257153 [08:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:22] T257153: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 [08:13:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable sitelinks to testcommons from test wikidata sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) (owner: 10Addshore) [08:13:51] hashar: how did the jerkins restart go? :) [08:13:57] doing it ;) [08:14:35] cool! I'm looking to schedule an inpromptu deploy slot to revert some reverts, but I'll wait until jenkins is all happy and green and not backlogged! [08:15:06] !log upgrading and restart CI Jenkins on contint2001 # T256978 [08:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:12] T256978: Upgrade Jenkins instances to 2.235.1 - https://phabricator.wikimedia.org/T256978 [08:15:22] !log cr3-knams> request vmhost snapshot - T257153 [08:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:39] (03CR) 10Addshore: [C: 03+1] "diff looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) (owner: 10Addshore) [08:16:09] Jul 07 08:15:42 contint2001 systemd[1]: jenkins.service: Current command vanished from the unit file, execution of the command list won't be resumed. [08:16:12] fun message [08:17:26] !log cr2-eqdfw> request vmhost snapshot - T257153 [08:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:31] T257153: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 [08:18:45] addshore: surprisingly Jenkins seems to still be working ;] [08:18:48] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hadoop.change-distro (exit_code=97) [08:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11768 and previous config saved to /var/cache/conftool/dbconfig/20200707-081909-marostegui.json [08:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:16] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [08:19:52] !log cr2-eqord> request vmhost snapshot - T257153 [08:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:52] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Legoktm) For reference, all of the current discussion seems to be taking place in {T249419}. [08:22:36] (03PS56) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 (https://phabricator.wikimedia.org/T254248) [08:22:46] !log cr2-eqsin> request vmhost snapshot - T257153 [08:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:51] T257153: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 [08:23:22] (03PS7) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [08:26:13] !log cr2-codfw> request vmhost snapshot routing-engine both - T257153 [08:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:25] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10guergana.tzatchkova) >>! In T256201#6284285, @jcrespo wrote: > One question, @guergana.tzatchkova you are only asking access to the wmde group, not the nda one, cor... [08:30:34] !log kormat@cumin2001 START - Cookbook sre.hosts.downtime [08:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:48] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10Addshore) [08:30:50] (03PS1) 10Elukey: sre.hadoop.change-distro: improve procedure and logging [cookbooks] - 10https://gerrit.wikimedia.org/r/609975 (https://phabricator.wikimedia.org/T244499) [08:30:59] 10Operations, 10netops: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 (10ayounsi) Left to do: cr1/2-eqiad. [08:31:02] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10Addshore) [08:31:10] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Addshore) [08:31:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11769 and previous config saved to /var/cache/conftool/dbconfig/20200707-083144-marostegui.json [08:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:49] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [08:33:08] !log kormat@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:18] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10Addshore) [08:34:06] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10Addshore) Indeed this should be both the nda group and the wmde group. (from memory we need to sign an and to be in the wmde group anyway). Confirmati... [08:34:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609835 (owner: 10CRusnov) [08:43:49] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10akosiaris) p:05Triage→03Medium a:03jijiki [08:54:49] (03CR) 10Jcrespo: "Some options below" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:56:18] (03PS1) 10Kormat: es2021: Re-enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/609977 (https://phabricator.wikimedia.org/T257284) [08:56:28] (03PS4) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [08:56:56] (03PS4) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [08:57:20] (03PS2) 10Elukey: sre.hadoop.change-distro: improve procedure and logging [cookbooks] - 10https://gerrit.wikimedia.org/r/609975 (https://phabricator.wikimedia.org/T244499) [08:58:11] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10jcrespo) >>! In T256201#6284655, @Addshore wrote: > Indeed this should be both the nda group and the wmde group. > (from memory we need to sign an and... [08:58:57] (03PS5) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [08:59:16] (03PS8) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [08:59:19] (03PS1) 10Giuseppe Lavagetto: wmflib::service::get_url: fix lookup for listeners [puppet] - 10https://gerrit.wikimedia.org/r/609978 [08:59:21] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro: improve procedure and logging [cookbooks] - 10https://gerrit.wikimedia.org/r/609975 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:02:13] (03PS1) 10Ayounsi: esams: set prepending [homer/public] - 10https://gerrit.wikimedia.org/r/609980 [09:02:19] (03PS9) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [09:05:06] (03CR) 10Ayounsi: [C: 03+2] esams: set prepending [homer/public] - 10https://gerrit.wikimedia.org/r/609980 (owner: 10Ayounsi) [09:05:29] (03Merged) 10jenkins-bot: esams: set prepending [homer/public] - 10https://gerrit.wikimedia.org/r/609980 (owner: 10Ayounsi) [09:05:46] (03CR) 10JMeybohm: [C: 04-1] ""common_templates/0.1/_tls_helpers.tpl" should no longer be used (and not be modified)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [09:06:59] (03CR) 10Privacybatm: "> Patch Set 4:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:07:11] (03PS5) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [09:07:37] (03CR) 10Urbanecm: [C: 04-2] "> Patch Set 5: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [09:07:52] (03PS6) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [09:08:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:10:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11770 and previous config saved to /var/cache/conftool/dbconfig/20200707-091015-marostegui.json [09:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:21] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [09:14:01] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Peachey88) [09:14:37] (03CR) 10Jcrespo: "> For the time being, as mktemp automatically create a file at temp folder and return the path, shall I move ahead with mktemp then?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:15:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23722/restbase-dev1006.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [09:16:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib::service::get_url: fix lookup for listeners [puppet] - 10https://gerrit.wikimedia.org/r/609978 (owner: 10Giuseppe Lavagetto) [09:17:42] (03PS6) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [09:18:11] (03CR) 10jerkins-bot: [V: 04-1] java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [09:18:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Would it be possible to drop the mysql service to avoid future confusions?" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [09:18:46] (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:20:05] 10Operations, 10WMF-Annual-Report, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2019 Annual Report - https://phabricator.wikimedia.org/T257257 (10Peachey88) [09:20:46] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) @wiki_willy this host is under warranty, can we get a new disk for it? ` [35898752.940170] megaraid_sas 0000:18:00.0: 726 (647382021s/0x0001/CRIT) - VD 00/0 is now DEGRADED [35898999.592143] m... [09:21:20] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10aborrero) @MoritzMuehlenhoff I'm now thinking this is going to happen with every single debian release (archival of the backports repo). Perhaps a more sustainable way to approach this is to mir... [09:22:06] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) p:05Triage→03High This is s6 primary database master [09:22:24] (03CR) 10Jcrespo: "I have a gut sense that this fixes something but also breaks something, but let me test it in case I am wrong." [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:23:19] (03CR) 10Jforrester: [C: 03+2] Branch commit for wmf/1.35.0-wmf.40 [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/609892 (https://phabricator.wikimedia.org/T256668) (owner: 10TrainBranchBot) [09:23:33] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) Controller's log in case it is needed to get the RMA: ` seqNum: 0x000002d1 Time: Mon Jul 6 20:20:21 2020 Code: 0x0000010c Class: 1 Locale: 0x02 Event Description: PD 00(e0x20/s0) Path 500056... [09:23:38] (03PS7) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [09:23:55] (03PS7) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [09:23:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11771 and previous config saved to /var/cache/conftool/dbconfig/20200707-092357-marostegui.json [09:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:02] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [09:24:10] jouncebot: next [09:24:10] In 0 hour(s) and 5 minute(s): Reverting some reverts & config change (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T0930) [09:24:29] !log 1.35.0-wmf.40 was branched at 88ecd6df00a46e432c06c1cf40d5098128abc4d8 for T256668 [09:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:33] T256668: 1.35.0-wmf.40 deployment blockers - https://phabricator.wikimedia.org/T256668 [09:25:34] (03CR) 10Ammarpad: Rename WPBSkinBlacklist to WPBSkinDisabled (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) (owner: 10Peter.ovchyn) [09:26:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es1024 as it is the current master T255755', diff saved to https://phabricator.wikimedia.org/P11772 and previous config saved to /var/cache/conftool/dbconfig/20200707-092635-marostegui.json [09:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:09] (03PS8) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [09:28:30] (03PS4) 10Alexandros Kosiaris: proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) [09:28:44] (03CR) 10jerkins-bot: [V: 04-1] proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [09:28:53] (03CR) 10Elukey: [C: 03+1] systemd/slice: Install systemd 241 from component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:29:19] (03CR) 10Jcrespo: "I thought it was going to not fail with 3 colons, but it caught it." [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:29:37] (03CR) 10Marostegui: [C: 03+1] es2021: Re-enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/609977 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [09:30:05] addshore: That opportune time is upon us again. Time for a Reverting some reverts & config change deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T0930). [09:30:05] addshore: A patch you scheduled for Reverting some reverts & config change is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [09:30:40] (03CR) 10Kormat: [C: 03+2] es2021: Re-enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/609977 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [09:30:41] (03CR) 10Addshore: [C: 03+2] Wikibase: stop using wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609645 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [09:31:30] (03Merged) 10jenkins-bot: Wikibase: stop using wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609645 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [09:32:44] (03CR) 10Privacybatm: "> Patch Set 2:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:33:38] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [09:33:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315', diff saved to https://phabricator.wikimedia.org/P11773 and previous config saved to /var/cache/conftool/dbconfig/20200707-093345-marostegui.json [09:33:45] <_joe_> !log depooling restbase1025 while we fix the troubled relationship between envoy and proton [09:33:46] (03PS8) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:55] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:609645]] T257266 T241975 Wikibase: stop using wmgUseEntitySourceBasedFederation (take2) (duration: 00m 59s) [09:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:01] T241975: entitysources: Remove old MultiRepository & PerRepository Service containers and config - https://phabricator.wikimedia.org/T241975 [09:34:01] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [09:34:04] (03CR) 10Addshore: [C: 03+2] Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609986 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [09:34:47] !log bounce logstash on logstash1023 [09:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:58] (03Merged) 10jenkins-bot: Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609986 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [09:36:13] <_joe_> !log applying the new configuration using the service proxy to restbase2009 too [09:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:24] <_joe_> !log errata: restbase2010, not 2009 [09:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:15] !log addshore@deploy1001 Synchronized wmf-config: [[gerrit:609986]] T257266 T241975 Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take2) (duration: 00m 57s) [09:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:29] (03PS3) 10Addshore: Enable sitelinks to testcommons from test wikidata sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) [09:37:47] (03CR) 10Addshore: [C: 03+2] Enable sitelinks to testcommons from test wikidata sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) (owner: 10Addshore) [09:38:33] (03Merged) 10jenkins-bot: Enable sitelinks to testcommons from test wikidata sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609971 (https://phabricator.wikimedia.org/T257266) (owner: 10Addshore) [09:40:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool es2021 after reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11774 and previous config saved to /var/cache/conftool/dbconfig/20200707-094017-kormat.json [09:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:22] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [09:40:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [09:40:34] 10Operations, 10Wikimedia-Logstash: No logs ingested in logstash7 since 2020-07-06 19:23 - https://phabricator.wikimedia.org/T257294 (10fgiunchedi) [09:41:01] (03CR) 10Hnowlan: [C: 03+1] proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [09:41:18] (03PS9) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [09:41:27] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10aborrero) p:05Triage→03Medium Will talk about this in our next team meeting. [09:41:44] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:42:50] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:609971]] T257266 Enable sitelinks to testcommons from test wikidata sites (duration: 00m 56s) [09:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:55] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [09:43:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.35.0-wmf.40 [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/609892 (https://phabricator.wikimedia.org/T256668) (owner: 10TrainBranchBot) [09:50:40] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:51:07] (03PS9) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [09:51:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Seems sensible to me, my comment can be acted upon or not." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [09:51:23] (03PS10) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [09:53:42] (03CR) 10Elukey: profile::mediawiki::mcrouter_wancache: send probe after 60s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [09:54:26] (03PS10) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [09:54:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P11775 and previous config saved to /var/cache/conftool/dbconfig/20200707-095428-marostegui.json [09:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:34] (03PS11) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [09:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3315', diff saved to https://phabricator.wikimedia.org/P11776 and previous config saved to /var/cache/conftool/dbconfig/20200707-095443-marostegui.json [09:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:30] (03PS1) 10JMeybohm: chartmuseum: fix typo in config template [puppet] - 10https://gerrit.wikimedia.org/r/609984 (https://phabricator.wikimedia.org/T253843) [09:55:42] (03CR) 10jerkins-bot: [V: 04-1] java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [09:55:44] (03PS1) 10Addshore: Make testcommonswiki a testwikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609985 (https://phabricator.wikimedia.org/T257266) [09:56:20] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/609784 (owner: 10Jbond) [09:56:22] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: fix typo in config template [puppet] - 10https://gerrit.wikimedia.org/r/609984 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:57:19] (03CR) 10Muehlenhoff: java: update java.security to support specifying different EDG's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [09:58:49] (03PS2) 10Addshore: Make testcommonswiki a testwikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609985 (https://phabricator.wikimedia.org/T257266) [09:59:46] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:59:59] (03PS1) 10Elukey: Remove archiva1001 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/610006 (https://phabricator.wikimedia.org/T252767) [10:01:00] * addshore is deploying one more thing [10:01:00] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:01:01] (03CR) 10Elukey: [C: 03+2] Remove archiva1001 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/610006 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:02:53] (03PS11) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [10:02:57] (03CR) 10Jbond: java: update java.security to support specifying different EDG's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [10:03:00] (03PS1) 10Giuseppe Lavagetto: restbase: stop using envoy for proton [puppet] - 10https://gerrit.wikimedia.org/r/610007 [10:03:02] (03CR) 10Addshore: [C: 03+2] Make testcommonswiki a testwikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609985 (https://phabricator.wikimedia.org/T257266) (owner: 10Addshore) [10:03:06] (03PS12) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [10:03:06] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [10:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [10:03:18] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P11777 and previous config saved to /var/cache/conftool/dbconfig/20200707-100328-marostegui.json [10:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:59] (03Merged) 10jenkins-bot: Make testcommonswiki a testwikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609985 (https://phabricator.wikimedia.org/T257266) (owner: 10Addshore) [10:05:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: stop using envoy for proton [puppet] - 10https://gerrit.wikimedia.org/r/610007 (owner: 10Giuseppe Lavagetto) [10:05:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:06] !log decommission archiva1001 [10:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:14] 10Operations, 10CAS-SSO, 10Patch-For-Review: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10jbond) >>! In T251513#6282176, @MoritzMuehlenhoff wrote: > which looks a little odd, but it's easy enough to move back to the start page via the URL bar. Yes this is a bit odd... [10:08:02] 10Operations, 10netops, 10Sustainability (Incident Prevention): Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 (10ayounsi) [10:08:09] (03PS1) 10MarcoAurelio: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609990 [10:08:13] !log addshore@deploy1001 sync-file aborted: [[gerrit:609985]] Make testcommonswiki a testwikidata client T257266 PT1/2 (duration: 00m 36s) [10:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:17] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [10:08:23] (03PS1) 10Filippo Giunchedi: logstash: fix kafka input ssl configuration for eventgate validation errors [puppet] - 10https://gerrit.wikimedia.org/r/610008 (https://phabricator.wikimedia.org/T257294) [10:09:03] * addshore watches a slight increase in log msgs from commons wiki [10:09:34] noooo missed review 610000 :( [10:09:43] dammit [10:09:57] (03CR) 10Jbond: "> I am afraid I don't know of a way to check Horizon in Hiera, which is kind of why i avoid using it and put everything in the repo." [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:10:34] !log addshore@deploy1001 Synchronized dblists/wikidataclient-test.dblist: [[gerrit:609985]] Make testcommonswiki a testwikidata client T257266 PT1/2 (duration: 00m 56s) [10:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3315', diff saved to https://phabricator.wikimedia.org/P11778 and previous config saved to /var/cache/conftool/dbconfig/20200707-101043-marostegui.json [10:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:50] (03PS2) 10Filippo Giunchedi: logstash: fix kafka input ssl configuration for eventgate validation errors [puppet] - 10https://gerrit.wikimedia.org/r/610008 (https://phabricator.wikimedia.org/T257294) [10:11:05] !log addshore@deploy1001 sync-file aborted: [[gerrit:609985]] Make testcommonswiki a testwikidata client T257266 PT1/2 (duration: 00m 00s) [10:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:10] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:12:04] !log addshore@deploy1001 Synchronized wmf-config/config/testcommonswiki.yaml: [[gerrit:609985]] Make testcommonswiki a testwikidata client T257266 PT2/2 (duration: 00m 55s) [10:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:14] (03PS1) 10Elukey: Remove leftovers of archiva1001 (decommed) [puppet] - 10https://gerrit.wikimedia.org/r/610009 [10:12:42] (03CR) 10Elukey: [C: 03+2] Remove leftovers of archiva1001 (decommed) [puppet] - 10https://gerrit.wikimedia.org/r/610009 (owner: 10Elukey) [10:14:15] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: fix kafka input ssl configuration for eventgate validation errors [puppet] - 10https://gerrit.wikimedia.org/r/610008 (https://phabricator.wikimedia.org/T257294) (owner: 10Filippo Giunchedi) [10:14:43] apergos: FYI i just saw "/home/ariel/scoretesting/GetSomeLYFiles.php: PHP Warning: Use of undefined constant wantedhashdir - assumed 'wantedhashdir'" in logstash lots :P [10:15:20] yeah sorry, and i fixed it [10:15:22] (03PS2) 10MarcoAurelio: [arwiki] Grant 'patrolmarks' to the all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609990 (https://phabricator.wikimedia.org/T257106) [10:15:22] addshore [10:15:31] :D [10:15:36] that's me trying to do some followup and being a crap php programmer [10:15:44] (03PS1) 10Marostegui: mariadb: Promote db1080 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/610010 (https://phabricator.wikimedia.org/T256717) [10:16:05] (03CR) 10Marostegui: [C: 04-2] "Wait for failover day" [puppet] - 10https://gerrit.wikimedia.org/r/610010 (https://phabricator.wikimedia.org/T256717) (owner: 10Marostegui) [10:17:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [10:17:23] (03PS3) 10MarcoAurelio: [arwiki] Grant 'patrolmarks' to all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609990 (https://phabricator.wikimedia.org/T257106) [10:18:08] (03CR) 10jerkins-bot: [V: 04-1] [arwiki] Grant 'patrolmarks' to all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609990 (https://phabricator.wikimedia.org/T257106) (owner: 10MarcoAurelio) [10:18:39] addshore: Prod clear or are you still fiddling? I'd like to get the train out at some point. :-) [10:18:47] * addshore is done! [10:18:53] Cool. [10:19:05] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001 job=burrow partition={0,1,10,11,2,3,4,5,6,7,8,9} site=eqiad topic={deprecated,logback-error,logback-info,logback-warn,rsyslog-err,rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wi [10:19:05] ka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [10:19:07] i had to get that last one out though.... as I just noticed testcommons was sharing cache keys with prod sites .... [10:19:14] * James_F steals the conch. [10:19:16] *facepalm* ... [10:19:21] jouncebot: now [10:19:21] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [10:19:24] the too many message sin kafka is mee, expected [10:19:36] 10Operations: Improve sre.hosts.decommission - https://phabricator.wikimedia.org/T257297 (10elukey) [10:19:38] only affects the logstash7 cluster, not production [10:19:50] addshore: ^ in case you were worried [10:19:59] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609990 (https://phabricator.wikimedia.org/T257106) (owner: 10MarcoAurelio) [10:20:29] ty [10:20:32] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1080 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/610010 (https://phabricator.wikimedia.org/T256717) (owner: 10Marostegui) [10:20:42] (03PS12) 10Jbond: java: update java.security to support specifying different EDG's [puppet] - 10https://gerrit.wikimedia.org/r/609783 [10:21:05] (03CR) 10Jbond: java: update java.security to support specifying different EDG's (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [10:21:12] (03PS13) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [10:22:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:22:30] (03PS1) 10Elukey: Decommission archiva1001 [dns] - 10https://gerrit.wikimedia.org/r/610011 (https://phabricator.wikimedia.org/T252767) [10:23:24] (03CR) 10Elukey: [C: 03+2] Decommission archiva1001 [dns] - 10https://gerrit.wikimedia.org/r/610011 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:23:52] (03PS1) 10Giuseppe Lavagetto: restbase: also add eventgate-main to the envoy listeners [puppet] - 10https://gerrit.wikimedia.org/r/610012 [10:23:53] James_F: also, where in the world are you? is it early or late? [10:24:04] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:25:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: also add eventgate-main to the envoy listeners [puppet] - 10https://gerrit.wikimedia.org/r/610012 (owner: 10Giuseppe Lavagetto) [10:26:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609784 (owner: 10Jbond) [10:26:30] 10Operations, 10DBA, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) Failover procedure: OLD MASTER: db1097 NEW MASTER: db1080 [x] Check configuration differences between new and old master `$ pt-config-diff h=db1097.eqiad.wmn... [10:26:38] !log prune PHP 7.0 packages from mw2135-mw2147 [10:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:48] addshore: London. [10:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3315', diff saved to https://phabricator.wikimedia.org/P11779 and previous config saved to /var/cache/conftool/dbconfig/20200707-102757-marostegui.json [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:23] (03Abandoned) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 (owner: 10Jbond) [10:29:03] (03Abandoned) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:30:29] (03CR) 10Jbond: [C: 03+1] pcc: Also recommend jenkinsapi Debian package [puppet] - 10https://gerrit.wikimedia.org/r/598704 (owner: 10Muehlenhoff) [10:30:47] (03PS1) 10MarcoAurelio: [hiwikibooks] Translate sitename for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609991 (https://phabricator.wikimedia.org/T256587) [10:32:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P11780 and previous config saved to /var/cache/conftool/dbconfig/20200707-103255-marostegui.json [10:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:00] addshore: So… neither? [10:33:02] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] systemd/slice: Install systemd 241 from component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609419 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [10:33:16] James_F: gotcha :) [10:33:31] * James_F grins. [10:33:45] I always find `scap clean` a little worrying to run. [10:33:46] London, where time is none [10:33:51] <_joe_> it's almost lunchtime in london [10:33:59] <_joe_> legoktm: talk about people around late [10:34:13] Yes, dumb script, please do just delete 4GB of code off all the production servers. Let's hope it's not running anywhere. You know, that's sane. *sighs* [10:34:14] * addshore will be in london in a month ish [10:34:52] _joe_: maybe I'm just around early today :p [10:34:55] _joe_: Only if you're running a nursery. Lunch isn't until 13:00 in civilised places. ;-) [10:34:59] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:35:39] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:36:37] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [10:36:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: shinken: refresh URLs with new toolforge.org domain [puppet] - 10https://gerrit.wikimedia.org/r/610013 (https://phabricator.wikimedia.org/T234617) [10:36:54] (03PS1) 10Giuseppe Lavagetto: restbase-dev: add the service proxy [puppet] - 10https://gerrit.wikimedia.org/r/610014 [10:38:21] (03CR) 10Majavah: toolforge: shinken: refresh URLs with new toolforge.org domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610013 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:38:35] (03CR) 10Legoktm: [C: 03+1] "LGTM. A future enhancement could be to check that tools.wmflabs.org redirects to toolforge.org as expected." [puppet] - 10https://gerrit.wikimedia.org/r/610013 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:39:04] <_joe_> apergos: your script is causing some errors (see the alert earlier) [10:40:11] my script is doing nothing now, I fixed it and it's idle (see add shore's earlier remark) [10:40:15] but thanks for the ping [10:40:32] i'm now reading mw swift code and feeling sorry for myself [10:40:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "> Patch Set 1: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610013 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:40:55] _joe_: re: the mw exceptions alert from before it could also be a false positive from logstash7 catching up on the backlog [10:41:00] sorry about that [10:41:24] I have disabled the statsd output temporarily too, so shouldn't happen again for now [10:41:25] (03PS1) 10Muehlenhoff: Stop installing git-lfs from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/610015 (https://phabricator.wikimedia.org/T256877) [10:43:14] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [10:44:34] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.38 (duration: 17m 23s) [10:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:49] <_joe_> sigh [10:45:05] what happened there? [10:45:09] (03PS1) 10Jforrester: testwikis wikis to 1.35.0-wmf.40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610016 [10:45:11] (03CR) 10Jforrester: [C: 03+2] testwikis wikis to 1.35.0-wmf.40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610016 (owner: 10Jforrester) [10:45:12] <_joe_> bouncer went down [10:45:17] ah ha [10:45:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase-dev: add the service proxy [puppet] - 10https://gerrit.wikimedia.org/r/610014 (owner: 10Giuseppe Lavagetto) [10:45:53] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.40 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610016 (owner: 10Jforrester) [10:46:03] !log jforrester@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.40 [10:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:27] James_F standing by in case of deprecation alerts from Revision [10:46:27] (03PS1) 10Muehlenhoff: Remove obsolete apt::pin for librdkafka1 [puppet] - 10https://gerrit.wikimedia.org/r/610017 [10:46:38] DannyS712: Cool. Thank you. [10:46:43] (03PS1) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [10:47:10] _joe_: I was saying the mw exceptions alert might be an artifact from logstash7 catching up, I've disabled the statsd output for now so shouldn't happen again [10:47:17] jouncebot: next [10:47:17] In 0 hour(s) and 12 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1100) [10:47:33] <_joe_> godog: not sure though, those errors were from a user script [10:48:00] _joe_: ah ok, nevermind [10:48:25] hauskatze: Ah, hmm, I'll crash the backport window a bit, sorry. [10:48:42] * James_F paints go-faster stripes on the side of scap. [10:48:53] (03PS2) 10MarcoAurelio: [hiwikibooks] Translate sitename for hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609991 (https://phabricator.wikimedia.org/T256587) [10:49:12] * DannyS712 investigates making scap more aerodynamic for improved performance [10:49:28] James_F: no probs, gives me a bit more time to finish :) [10:49:33] <_joe_> restbase-dev1004 is already known btw [10:51:15] (03PS2) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [10:53:05] (03CR) 10Peter.ovchyn: Rename WPBSkinBlacklist to WPBSkinDisabled (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) (owner: 10Peter.ovchyn) [10:53:41] (03CR) 10Ema: [C: 03+1] Remove obsolete apt::pin for librdkafka1 [puppet] - 10https://gerrit.wikimedia.org/r/610017 (owner: 10Muehlenhoff) [10:53:43] (03PS3) 10Peter.ovchyn: Remove WPBSkinBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) [10:54:35] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resou [10:54:35] mpany page content HTML for test page returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected st [10:54:35] ng: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 respond https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:55:13] (03PS4) 10Peter.ovchyn: Remove WPBSkinBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) [10:57:01] (03PS3) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [10:57:02] !log prune PHP 7.0 packages from mw2190-mw2214 [10:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:28] <_joe_> restbase-dev is nme, fixing [10:58:35] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:59:43] (03PS1) 10Giuseppe Lavagetto: restbase-dev: use the listeners for restbase, not the default ones [puppet] - 10https://gerrit.wikimedia.org/r/610021 [10:59:49] jouncebot: refresh [10:59:50] I refreshed my knowledge about deployments. [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1100). [11:00:04] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] restbase-dev: use the listeners for restbase, not the default ones [puppet] - 10https://gerrit.wikimedia.org/r/610021 (owner: 10Giuseppe Lavagetto) [11:00:04] hauskatze: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:43] missing space between window and (Max 6 patches) :-) [11:02:15] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:02:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:02:49] (Still scapping.) [11:03:39] <_joe_> a critical on mw exceptions [11:03:48] (03CR) 10Jbond: profile::librenms: update to use lookup instead of hiera call (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [11:03:52] Hmm. [11:04:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1110', diff saved to https://phabricator.wikimedia.org/P11781 and previous config saved to /var/cache/conftool/dbconfig/20200707-110412-marostegui.json [11:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:04:36] <_joe_> eventbus, apparently [11:04:39] <_joe_> disregard james [11:04:50] <_joe_> it's eventgate-main farting [11:04:55] Yeah. [11:05:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P11782 and previous config saved to /var/cache/conftool/dbconfig/20200707-110506-marostegui.json [11:05:07] Shock, etc. [11:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:27] <_joe_> sorry for the scare :) [11:07:03] * James_F grins. [11:07:33] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:16:01] (03PS1) 10Arturo Borrero Gonzalez: toolforge: ssh banners: refresh URLs [puppet] - 10https://gerrit.wikimedia.org/r/610024 (https://phabricator.wikimedia.org/T234617) [11:16:23] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:16:31] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:16:39] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:17:17] are all queueing-related soft alerts related to that known issue? [11:17:25] !log prune PHP 7.0 packages from mwdebug1001/2001/2002 [11:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:45] PROBLEM - puppet last run on restbase1021 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:17:50] PROBLEM - puppet last run on restbase2014 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:02] PROBLEM - puppet last run on restbase1023 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:02] PROBLEM - puppet last run on restbase2023 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:12] PROBLEM - puppet last run on restbase1019 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:20] PROBLEM - puppet last run on restbase2015 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:34] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 320.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:18:48] PROBLEM - puppet last run on restbase1027 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:19:06] PROBLEM - puppet last run on restbase1022 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:21:56] PROBLEM - puppet last run on restbase2017 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:22:52] RECOVERY - puppet last run on restbase2014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:23:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: ssh banners: refresh URLs [puppet] - 10https://gerrit.wikimedia.org/r/610024 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [11:24:28] (03PS1) 10Majavah: Change toolforge error pages to use toolforge logo instead of toollabs logo [puppet] - 10https://gerrit.wikimedia.org/r/610026 [11:26:25] (03PS5) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) [11:26:46] !log test bumping logstash7 batch size to 256 [11:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:37] James_F: still scapping? I don't mind scheduling my patches for another bacon window, they ain't urgent [11:28:14] RECOVERY - puppet last run on restbase1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:28:17] Yeah, sorry. [11:28:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1130', diff saved to https://phabricator.wikimedia.org/P11783 and previous config saved to /var/cache/conftool/dbconfig/20200707-112830-marostegui.json [11:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:38] James_F: no probs, really [11:28:44] Cool. [11:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082', diff saved to https://phabricator.wikimedia.org/P11784 and previous config saved to /var/cache/conftool/dbconfig/20200707-112926-marostegui.json [11:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:51] !log Deploy schema change on db1082, this will create lag on s5 labs [11:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:52] PROBLEM - puppet last run on restbase1026 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:31:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:31:09] Noted in the calendar [11:31:52] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:32:10] PROBLEM - puppet last run on restbase1024 is CRITICAL: CRITICAL: Puppet last ran 21 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:32:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:33:03] !log installing PHP 7.0 security updates [11:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:34] PROBLEM - puppet last run on restbase2020 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:33:50] RECOVERY - puppet last run on restbase1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:34:05] (03PS1) 10Ema: varnish: simplify rate limiting for cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/610027 [11:34:08] (03PS1) 10Ema: varnish: Facebook temporary experiment is permanent [puppet] - 10https://gerrit.wikimedia.org/r/610028 (https://phabricator.wikimedia.org/T192688) [11:34:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but also adding Bryan as reviewer before merging." [puppet] - 10https://gerrit.wikimedia.org/r/610026 (owner: 10Majavah) [11:35:10] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [11:37:34] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:37:40] RECOVERY - puppet last run on restbase1024 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:38:56] !log jforrester@deploy1001 scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "jforrester"; reason is "testwikis wikis to 1.35.0-wmf.40" (duration: 00m 00s) [11:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:02] RECOVERY - puppet last run on restbase2020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:39:19] !log jforrester@deploy1001 Started scap: Full scap and testwikis to 1.35.0-wmf.40 for T256668 [11:39:23] (03PS1) 10Arturo Borrero Gonzalez: toolforge: urlproxy: drop support for the legacy routing scheme [puppet] - 10https://gerrit.wikimedia.org/r/610029 (https://phabricator.wikimedia.org/T234617) [11:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:24] T256668: 1.35.0-wmf.40 deployment blockers - https://phabricator.wikimedia.org/T256668 [11:39:33] (Scap halted; re-starting.) [11:39:34] RECOVERY - puppet last run on restbase2015 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:40:32] RECOVERY - puppet last run on restbase1022 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:41:58] RECOVERY - puppet last run on restbase1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:43:06] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:43:30] RECOVERY - puppet last run on restbase2017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:44:20] RECOVERY - puppet last run on restbase1021 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:44:46] RECOVERY - puppet last run on restbase2023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:45:38] (03PS1) 10Jbond: (WIP) librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 [11:46:00] RECOVERY - puppet last run on restbase1027 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:51:00] (03PS1) 10Ema: varnish: apply 'public_clouds_shutdown' to all requests [puppet] - 10https://gerrit.wikimedia.org/r/610031 [11:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1082', diff saved to https://phabricator.wikimedia.org/P11786 and previous config saved to /var/cache/conftool/dbconfig/20200707-115838-marostegui.json [11:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:02] (03PS1) 10Jbond: graphite: add graphite host as a global [puppet] - 10https://gerrit.wikimedia.org/r/610035 [11:59:04] (03PS1) 10Jbond: profile::cassandra::single_instance: update to graphite_hosts global [puppet] - 10https://gerrit.wikimedia.org/r/610036 [11:59:58] (03CR) 10Ema: "I propose that we go one step further and just unconditionally block all public cloud requests when the switch is flipped: https://gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/609477 (owner: 10Jbond) [12:00:05] (03PS2) 10Jbond: profile::cassandra::single_instance: update to graphite_hosts global [puppet] - 10https://gerrit.wikimedia.org/r/610036 [12:00:12] (03CR) 10Ema: [C: 04-1] vcl: public_clouds_shutdown: ratelimit API reqs as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609480 (owner: 10CDanis) [12:00:33] (03CR) 10jerkins-bot: [V: 04-1] profile::cassandra::single_instance: update to graphite_hosts global [puppet] - 10https://gerrit.wikimedia.org/r/610036 (owner: 10Jbond) [12:01:18] !log Deploy schema change on labswiki (wikitech) master - T253276 [12:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:23] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [12:04:32] (03PS1) 10Kormat: mariadb: Promote es2021 to es5 master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/610038 (https://phabricator.wikimedia.org/T257284) [12:06:02] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove obsolete apt::pin for librdkafka1 [puppet] - 10https://gerrit.wikimedia.org/r/610017 (owner: 10Muehlenhoff) [12:07:10] (03CR) 10Marostegui: [C: 03+1] "This looks good, but requires coordination, as we need to do the topologies changes first, then the dbctl change and then we can merge and" [puppet] - 10https://gerrit.wikimedia.org/r/610038 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [12:09:39] (03PS1) 10Jbond: profile::statistics::explorer::misc_jobs: add graphite_host global [puppet] - 10https://gerrit.wikimedia.org/r/610039 [12:10:40] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [12:12:28] !log jforrester@deploy1001 Finished scap: Full scap and testwikis to 1.35.0-wmf.40 for T256668 (duration: 33m 09s) [12:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:33] T256668: 1.35.0-wmf.40 deployment blockers - https://phabricator.wikimedia.org/T256668 [12:12:41] Finally. [12:13:46] James_F any Revision issues? I'll go make some test edits [12:14:28] (03PS1) 10Muehlenhoff: systemd/slice: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/610040 [12:14:43] DannyS712: So far spotted an issue with the sidebar. [12:14:48] Otherwise, quiet. [12:15:18] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [12:15:26] 10Operations, 10Traffic: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [12:15:45] 10Operations: Upload git 2.20 package from stretch-backports to component/git - https://phabricator.wikimedia.org/T257308 (10hashar) [12:15:54] hmm, the centralnotice at the top says its still on .39, though Special:Version says .40 [12:16:09] 10Operations: Upload git 2.20 package from stretch-backports to component/git - https://phabricator.wikimedia.org/T257308 (10hashar) [12:16:12] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) [12:17:01] I hit the ratelimit with my vandalism test edits :( [12:17:20] hopefully I made enough to trigger any issues though [12:17:57] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] systemd/slice: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/610040 (owner: 10Muehlenhoff) [12:20:35] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) [12:20:47] Hmm. [12:20:59] .40 e/W/l/i/SettingsArray:60 Attempt to get non-existing setting "repoScriptPath" [12:22:03] James_F T257296 [12:22:03] T257296: Beta cluster wikidata fails to load scripts with "OutOfBoundsException from line 60 of /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/SettingsArray.php: Attempt to get non-existing setting "repoScriptPath" - https://phabricator.wikimedia.org/T257296 [12:23:01] Thanks. [12:24:08] (03PS1) 10Ema: ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) [12:27:11] addshore: ^^ FYI. :-( Train blocker, I think. [12:27:56] (03PS1) 10Ema: varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) [12:29:14] (03PS1) 10Muehlenhoff: Drop Puppet code which tries to install graphite-web from stretch-bpo [puppet] - 10https://gerrit.wikimedia.org/r/610043 (https://phabricator.wikimedia.org/T256877) [12:30:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Set es2021 to weight 50 T257284', diff saved to https://phabricator.wikimedia.org/P11787 and previous config saved to /var/cache/conftool/dbconfig/20200707-123003-kormat.json [12:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:09] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [12:33:04] (03PS1) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [12:34:01] (03PS2) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [12:35:07] (03PS1) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) [12:35:50] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610039 (owner: 10Jbond) [12:35:56] James_F that huge request url means that my notifications view on phabricator just stretched to three times as wide [12:36:19] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610036 (owner: 10Jbond) [12:36:58] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [12:37:46] (03PS4) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [12:39:04] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Tobi_WMDE_SW) >>! In T257038#6282017, @jcrespo wrote: > Access-wise, if someone (normally your manager or a WMF contact) can verify your WMDE contractual relationship? I bel... [12:39:42] (03CR) 10ZPapierski: add a README about the content of the commons structured data dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609823 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [12:40:07] (03PS5) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [12:41:42] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:41:54] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10hnowlan) 05Open→03Resolved [12:41:59] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hnowlan) [12:42:40] (03PS1) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [12:42:42] (03PS1) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [12:43:45] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609477 (owner: 10Jbond) [12:43:55] (03CR) 10Kormat: [C: 03+2] mariadb: Promote es2021 to es5 master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/610038 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [12:44:38] !log starting (codfw) es5 failover from es2020 to es2021 T257284 [12:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:43] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [12:46:23] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [12:46:25] (03PS1) 10Muehlenhoff: Don't include "backports" and "thirdparty" components into Stretch images [puppet] - 10https://gerrit.wikimedia.org/r/610049 [12:46:27] (03PS1) 10Muehlenhoff: Stop including backports in Stretch images [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) [12:47:36] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609477 (owner: 10Jbond) [12:50:16] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23738/" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [12:50:30] (03PS1) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [12:50:32] (03PS1) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [12:51:31] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10akosiaris) [12:57:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:40] (03PS2) 10Jbond: (WIP) librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 [12:59:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:10] (03CR) 10Vgutierrez: [C: 03+1] ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:06:28] (03PS3) 10Elukey: profile::mediawiki::mcrouter_wancache: send probe after 60s [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) [13:06:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove unused profile::logstash::collector::input_kafka_ssl_truststore_password [labs/private] - 10https://gerrit.wikimedia.org/r/609844 (owner: 10Ottomata) [13:07:30] (03PS6) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [13:08:06] (03CR) 10Elukey: profile::mediawiki::mcrouter_wancache: send probe after 60s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [13:08:07] (03PS3) 10Jbond: (WIP) librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 [13:12:30] (03CR) 10Vgutierrez: [C: 03+1] ATS: handle backend checks at healthcheck.wm.org/ats-be too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:12:41] (03PS1) 10Ottomata: Refine SearchSatisfaction using new eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/610055 (https://phabricator.wikimedia.org/T249261) [13:12:46] (03PS2) 10Ema: ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) [13:12:48] (03PS2) 10Ema: varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) [13:12:50] (03PS2) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) [13:12:52] (03PS2) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [13:12:54] (03PS2) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [13:12:56] (03PS2) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [13:12:58] (03PS2) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [13:13:34] (03PS1) 10Awight: Provision WMDE TeWü survey for prototype 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610056 (https://phabricator.wikimedia.org/T257306) [13:13:44] nice wall of CRs ema :D [13:14:25] (03CR) 10Ayounsi: [C: 03+1] "I'm no expert but it lgtm and renaming is coherent." [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [13:14:37] vgutierrez: I haven't broken the site by my own direct action in a while, it's exciting! [13:15:23] messing with healthchecks seems the way to go [13:15:24] !log kormat@cumin1001 dbctl commit (dc=all): 'Promote es2021 to es4 master T257284', diff saved to https://phabricator.wikimedia.org/P11789 and previous config saved to /var/cache/conftool/dbconfig/20200707-131524-kormat.json [13:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:29] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [13:15:33] I'm jealous I didn't think about that [13:17:09] (03PS7) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [13:17:23] 10Operations, 10Wikimedia-Logstash: No logs ingested in logstash7 since 2020-07-06 19:23 - https://phabricator.wikimedia.org/T257294 (10fgiunchedi) This is fixed now, though no alerts fired when no logs were ingested so I'll take over the task to fix that too [13:18:18] (03PS1) 10Ottomata: eventgate-* - bump to version 2020-07-07-130941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/610057 (https://phabricator.wikimedia.org/T239459) [13:18:22] (03CR) 10Ottomata: [C: 03+2] Refine SearchSatisfaction using new eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/610055 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [13:18:35] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 108.5 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [13:19:05] 10Operations, 10Wikimedia-Logstash: Alert on no (or "few") logs indexed (was: No logs ingested in logstash7 since 2020-07-06 19:23) - https://phabricator.wikimedia.org/T257294 (10fgiunchedi) [13:19:44] (03PS8) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [13:20:07] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) Updated.{F31919691} [13:20:34] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) Did a few more tests on this while working on {T257294} although bumping pipeline workers (20) and increasing... [13:20:43] (03CR) 10Vgutierrez: [C: 03+1] varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:21:24] (03PS4) 10Jbond: (WIP) librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 [13:21:57] ottomata: re: eventgate validation errors, I noticed there's a typo in the topic name FYI [13:22:03] ottomata: modules/profile/manifests/logstash/collector.pp: 'topic' => 'codfw.eventgate-loging-external.error.validation' [13:22:14] (03PS1) 10Alexandros Kosiaris: proton: Add upload-lb IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/610058 [13:22:27] ACKNOWLEDGEMENT - MariaDB Replica Lag: es4 on es2020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.21 seconds Kormat Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:22:51] (03CR) 10Vgutierrez: [C: 03+1] varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:22:56] (03PS2) 10Alexandros Kosiaris: proton: Add upload-lb IPs to calico configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/610058 [13:23:30] OH! [13:23:31] godog: thank you [13:23:33] fixing now [13:24:08] !log cr1-eqiad> request vmhost snapshot routing-engine both - T257153 [13:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:14] T257153: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 [13:24:58] ottomata: np! thank you [13:25:58] (03PS1) 10Ottomata: Fix topic name typo for eventgate validation error logstash collector [puppet] - 10https://gerrit.wikimedia.org/r/610059 (https://phabricator.wikimedia.org/T116719) [13:26:51] (03PS4) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [13:27:04] (03PS2) 10Ottomata: Fix topic name typo for eventgate validation error logstash collector [puppet] - 10https://gerrit.wikimedia.org/r/610059 (https://phabricator.wikimedia.org/T116719) [13:27:11] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [13:27:20] (03PS2) 10Filippo Giunchedi: logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854) [13:27:53] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:28:28] (03CR) 10Ottomata: [C: 03+2] Fix topic name typo for eventgate validation error logstash collector [puppet] - 10https://gerrit.wikimedia.org/r/610059 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [13:28:36] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [13:29:07] !log cr2-eqiad> request vmhost snapshot routing-engine both - T257153 [13:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:15] (03CR) 10Ottomata: [C: 03+2] eventgate-* - bump to version 2020-07-07-130941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/610057 (https://phabricator.wikimedia.org/T239459) (owner: 10Ottomata) [13:29:18] (03PS3) 10Ema: ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) [13:29:20] (03PS3) 10Ema: varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) [13:29:22] (03PS3) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) [13:29:24] (03PS3) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [13:29:26] (03PS3) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [13:29:28] (03PS3) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [13:29:30] (03PS3) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [13:30:09] (03PS5) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [13:30:37] (03PS3) 10Filippo Giunchedi: logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854) [13:30:53] (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:31:29] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:31:29] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:11] (03CR) 10Filippo Giunchedi: "Thanks for the changes! I think this is correct now, although my understanding is that the expired should run as a singleton across each c" [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [13:32:13] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:32:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:33:42] (03CR) 10Kosta Harlan: [C: 03+1] Remove old incorrect GrowthExperiments survey config from beta kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608892 (https://phabricator.wikimedia.org/T256828) (owner: 10Gergő Tisza) [13:34:46] (03CR) 10Kosta Harlan: [C: 04-1] "Should it be 240 or 270? The comment says one thing and the command argument another." [puppet] - 10https://gerrit.wikimedia.org/r/609226 (https://phabricator.wikimedia.org/T252575) (owner: 10Gergő Tisza) [13:34:52] (03PS4) 10Filippo Giunchedi: logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854) [13:35:06] 10Operations, 10netops, 10Sustainability (Incident Prevention): Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 (10ayounsi) 05Open→03Resolved All done! [13:35:09] (03PS1) 10Elukey: role::druid::test_analytics::worker: fix zookeper version for Buster [puppet] - 10https://gerrit.wikimedia.org/r/610060 [13:35:22] (03CR) 10Kosta Harlan: [C: 03+1] "Oh right, 240 plus 30. Sorry! Trying to review from a phone :/" [puppet] - 10https://gerrit.wikimedia.org/r/609226 (https://phabricator.wikimedia.org/T252575) (owner: 10Gergő Tisza) [13:35:53] 10Operations, 10Wikimedia-Logstash, 10observability: Alert on no (or "few") logs indexed (was: No logs ingested in logstash7 since 2020-07-06 19:23) - https://phabricator.wikimedia.org/T257294 (10fgiunchedi) [13:36:33] (03PS1) 10Kormat: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/610061 (https://phabricator.wikimedia.org/T257284) [13:37:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:37:42] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: fix zookeper version for Buster [puppet] - 10https://gerrit.wikimedia.org/r/610060 (owner: 10Elukey) [13:38:14] <_joe_> !log rolling restart of restbase to pick up using envoy [13:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:14] (03PS4) 10Elukey: profile::mediawiki::mcrouter_wancache: send probe after 60s [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) [13:39:36] (03PS5) 10Jbond: librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) [13:41:13] (03PS2) 10ArielGlenn: add a README about the content of the commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/609823 (https://phabricator.wikimedia.org/T221917) [13:42:06] (03PS9) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [13:42:39] (03CR) 10ZPapierski: [C: 03+1] add a README about the content of the commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/609823 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:43:01] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:43:13] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:43:13] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:44:19] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:45:12] (03PS5) 10Jcrespo: scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [13:46:01] (03PS3) 10ArielGlenn: add a README about the content of the commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/609823 (https://phabricator.wikimedia.org/T221917) [13:46:19] (03CR) 10Marostegui: [C: 03+1] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/610061 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [13:46:23] (03PS2) 10Jbond: graphite: add graphite host as a global [puppet] - 10https://gerrit.wikimedia.org/r/610035 [13:46:25] (03PS10) 10Jbond: profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) [13:46:27] (03PS6) 10Jbond: librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) [13:46:31] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:46:38] (03CR) 10Kormat: [C: 03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/610061 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [13:47:01] (03CR) 10Jcrespo: [C: 03+2] scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [13:48:13] (03CR) 10ArielGlenn: [C: 03+2] add a README about the content of the commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/609823 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:48:16] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::cassandra::single_instance: update to graphite_hosts global [puppet] - 10https://gerrit.wikimedia.org/r/610036 (owner: 10Jbond) [13:49:44] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [13:50:03] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:50:07] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:50:59] (03PS3) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [13:51:12] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610045 (owner: 10Jbond) [13:52:00] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/610039 (owner: 10Jbond) [13:52:43] (03CR) 10Andrew Bogott: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [13:52:46] (03CR) 10Filippo Giunchedi: [C: 03+1] Drop Puppet code which tries to install graphite-web from stretch-bpo [puppet] - 10https://gerrit.wikimedia.org/r/610043 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [13:53:16] (03PS7) 10Jbond: librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) [13:53:19] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:36] 10Operations, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10ssingh) a:05ssingh→03None The only remaining item on this task is the "cloud (labs) groups: deployment-prep", that is better suited for the r... [13:56:29] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23743/" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [13:56:37] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:56:43] (03CR) 10Jbond: "ready for review this should be no-op" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [13:58:25] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:58:34] (03PS2) 10Jbond: profile::statistics::explorer::misc_jobs: add graphite_host global [puppet] - 10https://gerrit.wikimedia.org/r/610039 [14:00:25] (03PS2) 10Andrew Bogott: Galera: use mariadb service name rather than mysql [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) [14:01:27] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:01:50] (03PS3) 10Jbond: profile::statistics::explorer::misc_jobs: add graphite_host global [puppet] - 10https://gerrit.wikimedia.org/r/610039 [14:02:02] (03PS1) 10Jcrespo: Revert "scap configuration for integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/609998 [14:03:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "scap configuration for integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/609998 (owner: 10Jcrespo) [14:03:35] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:03:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:07] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was rec [14:04:07] itech.wikimedia.org/wiki/Services/Monitoring/restbase [14:04:09] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:04:17] (03PS2) 10Jcrespo: Revert "scap configuration for integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/609998 [14:04:19] (03PS1) 10Ottomata: standard_packages - add httpie [puppet] - 10https://gerrit.wikimedia.org/r/610065 [14:04:26] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/610039 (owner: 10Jbond) [14:04:32] (03PS4) 10Jbond: profile::statistics::explorer::misc_jobs: add graphite_host global [puppet] - 10https://gerrit.wikimedia.org/r/610039 [14:04:56] (03PS4) 10Jbond: profile::cassandra::single_instance: update to graphite_hosts global [puppet] - 10https://gerrit.wikimedia.org/r/610036 [14:05:19] (03PS4) 10Jbond: statsite::instance: fix style violation in define [puppet] - 10https://gerrit.wikimedia.org/r/610045 [14:05:35] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response [14:05:35] ps://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:05:45] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:05:51] (03PS1) 10Hashar: Fix scap config for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/610066 (https://phabricator.wikimedia.org/T256005) [14:05:56] <_joe_> uhm [14:06:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:06:47] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:06:57] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:07:09] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:07:20] (03CR) 10Ottomata: "Exists in jessie, stretch and buster." [puppet] - 10https://gerrit.wikimedia.org/r/610065 (owner: 10Ottomata) [14:07:55] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:08:12] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:08:24] (03CR) 10Jcrespo: [C: 03+2] Fix scap config for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/610066 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [14:08:25] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:08:32] <_joe_> not sure why a rollign restart of restbase would cause all this [14:09:39] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:15] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:41] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:45] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:12:08] (03PS1) 10Ottomata: eventgate-* bump to 2020-07-07-140523-production to get updated schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/610067 (https://phabricator.wikimedia.org/T116719) [14:13:15] (03CR) 10Jbond: "ready" [puppet] - 10https://gerrit.wikimedia.org/r/610035 (owner: 10Jbond) [14:13:17] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:13:25] (03PS1) 10Hashar: Revert "Fix scap config for integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/610068 (https://phabricator.wikimedia.org/T256005) [14:13:27] (03CR) 10Elukey: [C: 03+1] "The target host is stat1007, all good afaics: https://puppet-compiler.wmflabs.org/compiler1003/23748/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/610039 (owner: 10Jbond) [14:13:42] (03CR) 10Ottomata: [C: 03+2] eventgate-* bump to 2020-07-07-140523-production to get updated schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/610067 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [14:13:52] (03CR) 10CDanis: [C: 03+1] standard_packages - add httpie [puppet] - 10https://gerrit.wikimedia.org/r/610065 (owner: 10Ottomata) [14:16:20] !log hashar@deploy1001 Started deploy [integration/docroot@708d3eb]: (no justification provided) [14:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:29] !log hashar@deploy1001 Finished deploy [integration/docroot@708d3eb]: (no justification provided) (duration: 00m 09s) [14:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:46] jouncebot: next [14:16:46] In 1 hour(s) and 43 minute(s): Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1600) [14:17:04] (03CR) 10Elukey: [C: 03+2] profile::mediawiki::mcrouter_wancache: send probe after 60s [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [14:19:00] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:20] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:20:20] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:57] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:55] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:17] (03Abandoned) 10Hashar: Revert "Fix scap config for integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/610068 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [14:22:55] (03CR) 10Andrew Bogott: [C: 03+2] Galera: use mariadb service name rather than mysql [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [14:24:05] (03PS1) 10Peter.ovchyn: Add defaults for initial state for sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610069 (https://phabricator.wikimedia.org/T254230) [14:24:22] (03CR) 10QChris: [C: 03+1] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [14:24:28] (03PS2) 10Ottomata: standard_packages - add httpie [puppet] - 10https://gerrit.wikimedia.org/r/610065 [14:24:47] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:24:47] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:36] (03PS1) 10Andrew Bogott: galera: fully/qualify/path to the init script we're removing [puppet] - 10https://gerrit.wikimedia.org/r/610070 [14:26:31] (03CR) 10Andrew Bogott: [C: 03+2] galera: fully/qualify/path to the init script we're removing [puppet] - 10https://gerrit.wikimedia.org/r/610070 (owner: 10Andrew Bogott) [14:27:57] (03CR) 10Ottomata: [C: 03+2] standard_packages - add httpie [puppet] - 10https://gerrit.wikimedia.org/r/610065 (owner: 10Ottomata) [14:29:21] (03CR) 10CDanis: "Can the ssh binary itself accept this file? For T257219 I was imagining something that was drop-in for use as a known_hosts.d file." [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [14:30:19] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:25] !log replacing msw-a5,a6,a7 and a8 [14:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:54] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:31:54] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:22] (03CR) 10CDanis: [C: 03+1] "I think this is probably good enough, although I speculate it might be possible to burst higher than 100 rps even as a human if you're doi" [puppet] - 10https://gerrit.wikimedia.org/r/610027 (owner: 10Ema) [14:32:59] PROBLEM - Host conf2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:03] PROBLEM - Host db2079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:09] PROBLEM - Host graphite2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:17] PROBLEM - Host db2097.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:25] PROBLEM - Host db2122.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:27] PROBLEM - Host es2017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:30] (03CR) 10QChris: [C: 04-1] "LGTM. CR-1 only because of a question about cleanup." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [14:33:35] (03PS1) 10Elukey: mcrouter: avoid systemd unit restart when config file change [puppet] - 10https://gerrit.wikimedia.org/r/610071 [14:33:37] RECOVERY - Host graphite2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [14:34:05] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:34:29] (03CR) 10CDanis: [C: 03+1] varnish: Facebook temporary experiment is permanent [puppet] - 10https://gerrit.wikimedia.org/r/610028 (https://phabricator.wikimedia.org/T192688) (owner: 10Ema) [14:35:17] any Revisions hits at https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-deprecated ? [14:35:28] (03CR) 10QChris: [C: 03+1] acme_chief: remove gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/609878 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [14:35:50] (03CR) 10CDanis: "I'm a little wary to deny all traffic incl API for too long a period -- I'm mostly worried about 'good' bots being unduly affected. Do yo" [puppet] - 10https://gerrit.wikimedia.org/r/610031 (owner: 10Ema) [14:36:20] (03CR) 10CDanis: [C: 03+1] ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:36:34] (03CR) 10CDanis: [C: 03+1] varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:36:48] (03CR) 10CDanis: [C: 03+1] varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:36:49] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01008 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:37:04] (03CR) 10CDanis: [C: 03+1] LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:37:30] (03CR) 10RLazarus: [C: 03+1] mcrouter: avoid systemd unit restart when config file change [puppet] - 10https://gerrit.wikimedia.org/r/610071 (owner: 10Elukey) [14:38:21] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:38:21] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:29] RECOVERY - Host conf2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms [14:38:35] RECOVERY - Host db2079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.44 ms [14:38:36] (03CR) 10CDanis: [C: 03+1] ATS: handle backend checks at healthcheck.wm.org/ats-be too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:38:49] RECOVERY - Host db2097.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [14:38:59] RECOVERY - Host db2122.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [14:39:03] RECOVERY - Host es2017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [14:39:21] ottomata: httpie : Depends: python-requests but it is not going to be installed :( [14:39:27] this on mc1034 [14:39:49] <_joe_> ottomata: revert please [14:40:02] reverting [14:40:10] <_joe_> also httpie is hardly a package that should be on every host imho, but that's besides my point :) [14:40:13] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:40:19] RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [14:40:35] (03PS1) 10Ottomata: Revert "standard_packages - add httpie" [puppet] - 10https://gerrit.wikimedia.org/r/609999 [14:41:01] _joe_: seemed like if we had curl everywhere httpie might be nice too [14:41:19] i wanted it on deployment host, and also the other hosts I use, but I can do those manually instead of in standard [14:41:26] <_joe_> ottomata: I would say having curl everywhere is a counterargument to it :) [14:41:40] 10Operations, 10Traffic: Consolidate misc servers at edge sites - https://phabricator.wikimedia.org/T257323 (10BBlack) p:05Triage→03Medium [14:41:41] <_joe_> or you can add it to the role for the cumin and deployment hosts [14:41:46] aye [14:41:48] <_joe_> via puppet [14:41:55] ha i see you like to make it hard to test things, you like a challenge :) [14:42:11] (03CR) 10Ottomata: [C: 03+2] Revert "standard_packages - add httpie" [puppet] - 10https://gerrit.wikimedia.org/r/609999 (owner: 10Ottomata) [14:42:11] <_joe_> also if you want to test stuff with http [14:42:16] <_joe_> there is httpbb [14:42:19] don't know it! [14:42:30] <_joe_> rzl: ^^ [14:42:33] oh cool [14:42:41] <_joe_> do some promotion of our baby :P [14:42:43] I haven't read back but https://wikitech.wikimedia.org/wiki/httpbb 👋 [14:42:48] oh that is awesome [14:42:59] that is mostly what I'm doing, esp when deploying eventgaet [14:43:02] 10Operations, 10Traffic: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10BBlack) [14:43:21] <_joe_> ottomata: the idea is to be able to write detailed tests and persist them into a file [14:43:28] this is great [14:43:28] 10Operations, 10Traffic: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10BBlack) p:05Triage→03Medium [14:43:31] so great [14:43:36] yeah, you can see the appserver httpbb tests linked from that page in the puppet repo [14:43:55] hm, would it appropriate to add some of those test cases into deployment-charts? [14:44:06] (03PS1) 10Ppchelko: Remove restbase2009 from RESTBase cassandra seeds [puppet] - 10https://gerrit.wikimedia.org/r/610072 (https://phabricator.wikimedia.org/T256863) [14:44:07] then I could just have them along with the helmfile services [14:44:14] (03CR) 10Elukey: [C: 03+2] mcrouter: avoid systemd unit restart when config file change [puppet] - 10https://gerrit.wikimedia.org/r/610071 (owner: 10Elukey) [14:44:17] i have a very manual version of this in my home dir [14:44:23] with a bunch of example events [14:44:30] that sounds non-crazy to me but I don't have a great sense of helm personally [14:44:31] and a curl wrapper that posts them [14:44:51] ottomata: ok to puppet-merge your change? [14:44:58] that does sound like exactly the kind of workflow httpbb is meant to replace yeah [14:45:10] I guess so yes [14:45:11] :) [14:45:11] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [14:45:16] yes plz sorry elukey got distracted reading aboyt httpbb :p [14:45:36] rzl: can it post? [14:45:49] ottomata: yep, "method: POST" [14:45:57] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:46:09] ottomata: you are perfectly excused since you were talking with two nice gentlemen :D [14:46:38] rzl looking for more docs, and post body? [14:46:42] (03CR) 10QChris: [C: 03+1] gerrit: stop rsyncing to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609883 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [14:46:43] and what about --resolve :p [14:46:43] ottomata: that said, I forget if I actually added request body support or just thought about it :P happy to, if it's missing [14:46:43] :) [14:46:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove restbase2009 from RESTBase cassandra seeds [puppet] - 10https://gerrit.wikimedia.org/r/610072 (https://phabricator.wikimedia.org/T256863) (owner: 10Ppchelko) [14:47:20] 10Operations, 10Traffic: Consolidate edge dnsbox servers into ganeti - https://phabricator.wikimedia.org/T257326 (10BBlack) [14:47:29] --resolve is the default behavior, it takes one or more host names on the command line and considers them independently of the Host header [14:47:40] (host names or addresses or whatevs) [14:47:42] (03CR) 10QChris: [C: 03+1] site/DHCP/partman: decom gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [14:47:45] (03PS2) 10JMeybohm: Don't include "backports" and "thirdparty" components into Stretch images [puppet] - 10https://gerrit.wikimedia.org/r/610049 (https://phabricator.wikimedia.org/T257327) (owner: 10Muehlenhoff) [14:47:45] gr8 [14:47:58] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:47:59] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:02] 10Operations, 10Traffic: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10BBlack) [14:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:06] (03CR) 10CDanis: [C: 03+1] ATS: handle backend checks at healthcheck.wm.org/ats-be too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:48:19] 10Operations, 10Traffic: Consolidate edge dnsbox servers into ganeti - https://phabricator.wikimedia.org/T257326 (10BBlack) p:05Triage→03Medium [14:48:26] (03PS2) 10JMeybohm: Stop including backports in Stretch images [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [14:48:49] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10JMeybohm) [14:49:04] ottomata: yeah, looks like there's nothing for request body yet -- if you feel like a bite-sized python change, feel free to add it to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/httpbb/+/refs/heads/master/httpbb/main.py and send me the review [14:49:08] otherwise I can add it this week [14:49:24] it certainly *should* be supported [14:49:30] ok thanks rzl might not have time atm but if/when i get around to making some test cases i'd be happy to suubmit a patch [14:50:43] ottomata: cool 👍 are you going to want form data, raw blob, both, something else? [14:51:39] (03CR) 10CDanis: [C: 03+1] varnish: handle checks at healthcheck.wm.org/varnish-fe too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:51:42] (03PS1) 10Elukey: mcrouter: avoid unit restarts when /etc/default/mcrouter changes [puppet] - 10https://gerrit.wikimedia.org/r/610077 [14:51:51] rzl: if you have time --^ :( [14:52:13] (03CR) 10CDanis: [C: 03+1] icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:52:22] (03CR) 10CDanis: [C: 03+1] varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:52:38] (03CR) 10CDanis: [C: 03+1] ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:52:41] (03PS12) 10QChris: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [14:53:01] hmmm, rzl eithewr raw blob or dict [14:53:03] <_joe_> !log restarted restbase on restbase2022 after removing restbase2009 from the cassandra seeds [14:53:04] (03CR) 10RLazarus: [C: 03+1] mcrouter: avoid unit restarts when /etc/default/mcrouter changes [puppet] - 10https://gerrit.wikimedia.org/r/610077 (owner: 10Elukey) [14:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:06] hop [14:53:09] oh right tests are yaml [14:53:19] (03CR) 10QChris: "> Nit: in `hieradata/cloud/eqiad1/devtools/common.yaml` there is also this line" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [14:53:28] rzl: actually, ok so reasonable request would just to be to provide the body in the yaml test case [14:53:29] (03CR) 10Elukey: [C: 03+2] mcrouter: avoid unit restarts when /etc/default/mcrouter changes [puppet] - 10https://gerrit.wikimedia.org/r/610077 (owner: 10Elukey) [14:53:39] post_body: [14:53:39] ... [14:54:04] what would be even MORE amazing....would be a way to lookup the post body given a URL and a JSON pointer or a json $ref [14:54:11] elukey: done -- in principle it'd be nice if we could have an alert for "this has been changed but mcrouter not restarted for $period_of_time" so that we don't get surprised sometime in the future [14:54:14] schemas have examples already :p [14:54:23] ottomata: for form data I was just thinking a nested mapping [14:54:25] rzl took the words out of my mouth [14:54:39] so like, post_data:\n\tfoo: bar [14:54:40] rzl: yeah it seems a good thing [14:54:41] e.g. [14:54:41] https://schema.wikimedia.org/repositories//primary/jsonschema/test/event/1.0.0.yaml [14:55:04] so if i could use a $ref pointer to get that........ [14:55:06] like [14:55:12] (03CR) 10QChris: [C: 04-1] "Keeping CR-1 to avoid overlooking the still open remark about ssh key formats from PS11" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [14:55:30] post_data: [14:55:30] $ref: https://schema.wikimedia.org/repositories//primary/jsonschema/test/event/1.0.0.yaml#/examples[0] [14:55:33] that would be grand [14:55:43] but, i consider ^ to be an unreasonable feature request :p [14:55:50] I can see why you want it but I want to think about that some more -- so far, the tests are hermetic and I consider that a feature [14:55:55] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was re [14:55:55] kitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:57] yeah indeed [14:57:47] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:31] !log hashar@deploy1001 Started deploy [integration/docroot@708d3eb]: Second deployment to ensure everything works fine. Thank you jynus [14:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:36] !log hashar@deploy1001 Finished deploy [integration/docroot@708d3eb]: Second deployment to ensure everything works fine. Thank you jynus (duration: 00m 04s) [14:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:34] (03PS14) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [15:00:30] (03PS8) 10JMeybohm: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 (https://phabricator.wikimedia.org/T253843) (owner: 10Alexandros Kosiaris) [15:01:03] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:26] (03CR) 10CDanis: [C: 03+1] profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:02:45] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [15:02:45] !log otto@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:02] (03CR) 10CDanis: librenms: add support for apereo cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:03:06] (03CR) 10Jbond: "> Patch Set 13:" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [15:04:11] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr - can you send in the RMA for this one, when you get in later today? Thanks, Willy [15:04:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:04:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:43] PROBLEM - Check systemd state on analytics1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] (03CR) 10Muehlenhoff: icinga: switch icinga to use apereo cas for authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [15:05:10] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10hashar) @ema the ssh key is described with `ema@ariel` which show up when running `keyholder status`. After the key got armed, I was looking for `deploy_ci_docroot`, or at least something... [15:05:44] (03PS8) 10Jbond: librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) [15:06:00] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:06:03] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:30] (03PS15) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [15:06:33] RECOVERY - Check systemd state on analytics1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:51] (03CR) 10Jbond: icinga: switch icinga to use apereo cas for authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [15:08:13] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10jcrespo) @ema please sync with me, I am guessing we could regenerate the key with a better identifier, if that is the issue. Other keys use the path /etc/keyholder.d/apache2modsec and that... [15:09:12] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [15:09:12] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:57] ottomata: scap showed no issues? [15:11:10] jynus: ? [15:11:42] we just deployment a configuration update for scap on deploy1001 [15:11:52] I was asking if you saw any error on deployment? [15:11:52] oh these deployes are ^^ helmfile k8s [15:11:53] not scap [15:11:56] ah, ok [15:12:01] 10Operations, 10Keyholder: After harming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10hashar) [15:12:12] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:12:12] !log otto@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [15:12:13] 10Operations, 10Keyholder: After harming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10hashar) [15:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:44] (03CR) 10CDanis: "Can you run PCC against both netmon1002 and also netmon200x? Maybe I'm not awake yet but it's not clear to me what happens with the backu" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:12:52] 10Operations, 10Keyholder: After arming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10jcrespo) [15:12:53] (03PS1) 10Herron: add thirdparty/elastic78 component [puppet] - 10https://gerrit.wikimedia.org/r/610079 (https://phabricator.wikimedia.org/T234854) [15:12:56] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Ottomata) [15:13:34] (03PS6) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [15:13:40] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Ottomata) All eventgate-* services are updated to service-runner 2.7.7 [15:13:58] !log hnowlan@deploy1001 Started restart [restbase/deploy@05b8bd5]: Restarting restbase after removal of restbase2009 [15:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:31] (03CR) 10jerkins-bot: [V: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [15:14:53] (03PS9) 10Jbond: librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) [15:14:58] (03CR) 10CDanis: [C: 03+1] "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [15:15:07] (03CR) 10Jbond: "chgeck experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:15:14] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:16:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "yep LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [15:17:46] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:18:01] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:18:03] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003149 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:18:17] (03PS7) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [15:18:20] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::statistics::explorer::misc_jobs: add graphite_host global [puppet] - 10https://gerrit.wikimedia.org/r/610039 (owner: 10Jbond) [15:18:28] (03PS3) 10Privacybatm: transfer.py: Refactor split_target function [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) [15:18:52] (03CR) 10CDanis: [C: 03+1] "ah sorry I now see what I had missed before, thanks! lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:19:17] (03CR) 10Privacybatm: "Please let me know how the tests are. Thank you!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [15:19:27] (03CR) 10jerkins-bot: [V: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [15:19:43] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [15:19:45] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:20:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [15:21:03] 10Operations, 10Toolforge: Get rid of Toolforge home page check from shinken - https://phabricator.wikimedia.org/T128615 (10Nintendofan885) [15:23:00] 10Operations, 10Toolforge: Get rid of Toolforge home page check from shinken - https://phabricator.wikimedia.org/T128615 (10Majavah) [15:23:20] (03PS4) 10Privacybatm: transfer.py: Refactor split_target function [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) [15:24:29] (03PS1) 10Hnowlan: restbase: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/610080 (https://phabricator.wikimedia.org/T256863) [15:24:57] PROBLEM - Check systemd state on analytics1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:57] PROBLEM - Druid historical on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:25:16] (03CR) 10Privacybatm: "Please let me know how the tests are. Thank you!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [15:25:21] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:25:29] multiple scripts on toolforge have been broken today [15:25:32] (03CR) 10Ppchelko: [C: 03+1] "ouch" [puppet] - 10https://gerrit.wikimedia.org/r/610080 (https://phabricator.wikimedia.org/T256863) (owner: 10Hnowlan) [15:25:41] ah, wrong chan [15:25:57] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH) RMA shipment info: UPS: 75095182 https://www.upspostsaleslogistics.com/cfw/tracking.screen RMA # R200300857 Delivery is set for tomorrow. [15:26:01] (03CR) 10Jbond: "> Patch Set 8: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [15:27:06] !log root-tmux on cumin1001 - cumin 'c:profile::mediawiki::mcrouter_wancache' '/usr/local/sbin/restart-mcrouter' -b 2 -s 5 - roll restart of mw-mcrouter to pick up new settings - T255511 [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:11] T255511: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 [15:27:27] (03CR) 10Hnowlan: [C: 03+2] restbase: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/610080 (https://phabricator.wikimedia.org/T256863) (owner: 10Hnowlan) [15:30:45] (03PS2) 10CRusnov: puppetdb microservice: Change allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/609835 [15:31:08] (03CR) 10CRusnov: [C: 03+2] puppetdb microservice: Change allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/609835 (owner: 10CRusnov) [15:33:29] PROBLEM - Druid middlemanager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:34:03] (03CR) 10CRusnov: [V: 03+2 C: 03+2] puppetdb microservice: Change allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/609835 (owner: 10CRusnov) [15:34:30] (03Abandoned) 10Jcrespo: Revert "scap configuration for integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/609998 (owner: 10Jcrespo) [15:34:57] PROBLEM - Druid coordinator on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:35:11] downtiming 1041 [15:35:56] (03CR) 10Jcrespo: "There is some overlap with prexisting "TestArgumentParsing" (I am guessing you knew that), but I think looks ok. Let me try to break the e" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [15:36:05] RECOVERY - Check systemd state on analytics1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:05] RECOVERY - Druid historical on analytics1041 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:36:33] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:49] RECOVERY - Druid coordinator on analytics1041 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:37:11] RECOVERY - Druid middlemanager on analytics1041 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:38:46] !log hnowlan@deploy1001 Started restart [restbase/deploy@05b8bd5]: Restarting restbase after removal of restbase2009 [15:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:33] (03CR) 10Herron: [C: 03+1] icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [15:40:45] PROBLEM - Host ms-be2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:50] 10Operations, 10Keyholder: After arming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10jcrespo) lets also make sure we document the new key/identity at https://wikitech.wikimedia.org/wiki/Keyholder using pwstore/pw.git/deployment-key-passphrase be... [15:43:57] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:14] (03CR) 10Privacybatm: [C: 04-1] "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [15:45:41] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:37] RECOVERY - Host ms-be2046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.71 ms [15:48:13] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) [15:49:08] (03CR) 10Filippo Giunchedi: [C: 03+1] add thirdparty/elastic78 component [puppet] - 10https://gerrit.wikimedia.org/r/610079 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:49:10] !log running nodetool removenode for restbase2009-a [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:08] godog, _joe_: I’d like to deploy the above revert (for train blocker T257296), but CI probably won’t finish within 10 minutes – is it okay if I steal some of your Puppet window for that? [15:50:09] T257296: Beta cluster wikidata fails to load scripts with "OutOfBoundsException from line 60 of /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/SettingsArray.php: Attempt to get non-existing setting "repoScriptPath" - https://phabricator.wikimedia.org/T257296 [15:50:39] <_joe_> Lucas_WMDE: go on [15:50:43] thanks [15:50:43] woop woop! [15:50:44] Lucas_WMDE: +1 [15:50:56] <_joe_> jouncebot: next [15:50:56] In 0 hour(s) and 9 minute(s): Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1600) [15:50:59] there are no puppet swat patches FWIW [15:51:08] <_joe_> yeah [15:51:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) (owner: 10Lucas Werkmeister (WMDE)) [15:51:11] (03CR) 10Herron: [C: 03+2] add thirdparty/elastic78 component [puppet] - 10https://gerrit.wikimedia.org/r/610079 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:51:15] /swat/request/ ;) [15:51:18] oh we dropped SWAT [15:51:19] <_joe_> it's not called swat anymore, *finally* [15:51:20] yeah, that [15:51:24] TIL [15:51:25] oh right [15:51:27] <_joe_> it's not like I didn't ask for it 6 years ago [15:51:31] :D [15:51:41] <_joe_> but back then not liking police-related terms wasn't a la page [15:51:50] <_joe_> happy to see the US coming around to it :) [15:52:49] <_joe_> godog: I vote we rename the puppet window "1312" [15:53:55] mmhh that went over my head _joe_, like 1331 but not ? [15:54:28] 1337 that is [15:55:20] but battery is almost out so I shall go [15:57:05] PROBLEM - Host cloudcontrol2001-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1600). [16:02:59] RECOVERY - Host cloudcontrol2001-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [16:03:25] (03PS5) 10Privacybatm: transfer.py: Refactor split_target function [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) [16:04:00] (03CR) 10Privacybatm: "I think now it is okay!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [16:04:12] (03PS1) 10Elukey: Fix Java package settings across Druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/610084 [16:05:14] (03CR) 10Elukey: [C: 03+2] Fix Java package settings across Druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/610084 (owner: 10Elukey) [16:05:49] herron: o/ [16:06:02] is it ok to puppet merge? [16:06:02] hey elukey [16:06:10] ah! yup please go ahead [16:06:58] done! [16:07:17] 10Operations, 10Cloud-Services, 10Toolforge, 10observability: Add other Tools administrators to the Icinga notification group - https://phabricator.wikimedia.org/T128715 (10Bstorm) The users are defined in the secrets management layer of puppet while the group list is in the more public puppet. This may be... [16:07:52] 10Operations, 10Cloud-Services, 10Toolforge, 10observability: Add other Tools administrators to the Icinga notification group - https://phabricator.wikimedia.org/T128715 (10Bstorm) [16:09:39] 10Operations, 10Toolforge, 10observability: Make icinga-wm report Tools homepage check at #wikimedia-cloud, too - https://phabricator.wikimedia.org/T128716 (10Bstorm) 05Open→03Declined I don't think it's necessary and am closing it. Please re-open if folks want it. We don't want to fill -cloud with spam! [16:11:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) (owner: 10Lucas Werkmeister (WMDE)) [16:14:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "> - quibble-vendor-mysql-php72-noselenium-docker https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/211" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) (owner: 10Lucas Werkmeister (WMDE)) [16:18:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:22:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:23:17] 10Operations, 10Phatality, 10observability: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10mmodell) @fgiunchedi I haven't deployed phatality because of the severity of this. I'm almost 100% certain that this will recur ev... [16:23:57] 10Operations, 10Phatality, 10observability: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10mmodell) There are actually a couple of minor updates to phatality that have been somewhat blocked by this issue and I just haven'... [16:27:49] (03PS1) 10Lucas Werkmeister (WMDE): Empty change to test CI [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610112 [16:33:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10MSantos) >>! In T257187#6282174, @thcipriani wrote: >>>! In T257187#6280782, @jcrespo wrote: >> @thcipriani I believe you will be the... [16:36:03] (03CR) 10jerkins-bot: [V: 04-1] Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) (owner: 10Lucas Werkmeister (WMDE)) [16:36:26] (03CR) 10Lucas Werkmeister (WMDE): [V: 03+2 C: 03+2] Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) (owner: 10Lucas Werkmeister (WMDE)) [16:37:45] ok I’m deploying that backport now [16:37:52] trying it first on mwdebug1001 [16:38:48] ok, seems to fix the issue on mwdebug1001 [16:38:50] syncing [16:39:26] (03CR) 10Andrew Bogott: [C: 03+1] haveged: install haveged on VM'si debian < buster by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 (owner: 10Jbond) [16:40:19] (03CR) 10Andrew Bogott: "This is just fine; this file is a sample file so specific syntax isn't super important." [puppet] - 10https://gerrit.wikimedia.org/r/608270 (owner: 10Jbond) [16:40:43] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.40/extensions/Wikibase: Backport: [[gerrit:610086|Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" (T257296)]] (duration: 01m 10s) [16:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:49] T257296: Beta cluster wikidata fails to load scripts with "OutOfBoundsException from line 60 of /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/SettingsArray.php: Attempt to get non-existing setting "repoScriptPath" - https://phabricator.wikimedia.org/T257296 [16:42:27] (03PS2) 10Andrew Bogott: openstack wmcs-prod-example.sh: fix shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/608270 (owner: 10Jbond) [16:43:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack wmcs-prod-example.sh: fix shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/608270 (owner: 10Jbond) [16:47:31] (03CR) 10jerkins-bot: [V: 04-1] Empty change to test CI [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610112 (owner: 10Lucas Werkmeister (WMDE)) [16:50:40] (03Abandoned) 10Lucas Werkmeister (WMDE): Empty change to test CI [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610112 (owner: 10Lucas Werkmeister (WMDE)) [16:50:52] (03CR) 10Lucas Werkmeister (WMDE): "> Patch Set 1:" [extensions/Wikibase] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610086 (https://phabricator.wikimedia.org/T257296) (owner: 10Lucas Werkmeister (WMDE)) [16:51:06] (03CR) 10Jdlrobson: [C: 03+1] "I will add this to https://wikitech.wikimedia.org/wiki/Deployments next week. I can do that along with the other change for Popupw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) (owner: 10Peter.ovchyn) [17:00:04] halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1700). [17:05:53] (03PS1) 10Krinkle: Improve logging for "main slot of revision not found in database" [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/610087 (https://phabricator.wikimedia.org/T256127) [17:06:11] !log removed restbase2009-b from cassandra pool, removing restbase2009-c [17:06:12] (03PS2) 10Krinkle: Improve logging for "main slot of revision not found in database" [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/610087 (https://phabricator.wikimedia.org/T256127) [17:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:22] (03Abandoned) 10Krinkle: Improve logging for "main slot of revision not found in database" [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/610087 (https://phabricator.wikimedia.org/T256127) (owner: 10Krinkle) [17:08:31] (03PS1) 10Krinkle: Improve logging for "main slot of revision not found in database" [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610088 (https://phabricator.wikimedia.org/T256127) [17:12:31] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10Andrew) *bump* [17:13:09] (03CR) 10BPirkle: [C: 03+1] "+1 is the highest gerrit will allow me to give. Approved for self-merge, if that's helpful." [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610088 (https://phabricator.wikimedia.org/T256127) (owner: 10Krinkle) [17:14:14] bpirkle: wmf branches and wmf-config must always match production, person merging = same person deploying it [17:14:19] if you have deploy access, you should have +2 here [17:14:46] but ofc even then you might want to +1 if you don't plan to deploy right now [17:15:49] Krinkle: thanks for info [17:20:37] (03CR) 10Hashar: "recheck the CI job got updated to volume mount the files under /etc/logstash : https://gerrit.wikimedia.org/r/c/integration/config/+/6027" [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:25:20] (03CR) 10Muehlenhoff: [C: 03+2] Drop Puppet code which tries to install graphite-web from stretch-bpo [puppet] - 10https://gerrit.wikimedia.org/r/610043 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [17:27:42] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [17:28:24] (03PS11) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [17:31:10] (03CR) 10Hashar: [C: 03+1] "Thank you Cole!" [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:35:18] (03CR) 10Krinkle: [C: 03+2] Improve logging for "main slot of revision not found in database" [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610088 (https://phabricator.wikimedia.org/T256127) (owner: 10Krinkle) [17:35:20] (03PS1) 10Muehlenhoff: Remove obsolete apt::pin for librdkafka1 on eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/610120 (https://phabricator.wikimedia.org/T256877) [17:39:17] (03PS1) 10Muehlenhoff: Remove stretch-backports from bootstrapvz config [puppet] - 10https://gerrit.wikimedia.org/r/610121 (https://phabricator.wikimedia.org/T256881) [17:43:35] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10KFrancis) @guergana.tzatchkova - I have the information I need. I sent the NDA for you to sign. Please be on the lookout for it. Thanks! [17:47:27] (03Abandoned) 10Muehlenhoff: Remove role::prometheus::k8s in favour of including the profile [puppet] - 10https://gerrit.wikimedia.org/r/575490 (owner: 10Muehlenhoff) [17:47:59] (03PS4) 10Muehlenhoff: Remove IDP defintions for logstash vhosts [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) [17:49:35] Krinkle: scap says you have the lock on deploy1001, is that valid? [17:51:26] (03CR) 10Elukey: [C: 03+1] Remove obsolete apt::pin for librdkafka1 on eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/610120 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [17:54:06] (03PS13) 10Dzahn: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [17:54:46] !log finished removing restbase2009 from cassandra pool [17:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:11] hnowlan: yes, I'm waiting for Jenkins [17:57:22] (03Merged) 10jenkins-bot: Improve logging for "main slot of revision not found in database" [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610088 (https://phabricator.wikimedia.org/T256127) (owner: 10Krinkle) [17:57:24] Krinkle: ah cool [17:57:27] which has now finished [17:57:33] took 25min :P [17:57:53] hnowlan: I can wait though, unlocked [17:58:06] Krinkle: I'm actually done for the day, don't mind me :) [17:58:11] ok [17:58:49] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Special:HideBanners is not really cacheable - https://phabricator.wikimedia.org/T256447 (10AndyRussG) Thanks @tstarling! As it turns out, current and upcoming browser restrictions on third-party cookies are... [17:59:13] !log imported (logstash|kibana|elasticsearch)-oss-7.8.0 into buster-wikimedia thirdparty/elastic78 [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1800) [18:03:52] (03CR) 10Dzahn: [C: 03+1] Gerrit: Rename ssh_host_key to ssh_host_rsa_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [18:08:48] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.40/includes/Revision/RevisionStore.php: I8f986daeab4 (duration: 01m 05s) [18:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:29] (03CR) 10Dzahn: [C: 03+1] Gerrit: Rename ssh_host_key to ssh_host_rsa_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [18:10:25] !log krinkle@deploy1001 Synchronized w/: remove untracked test cookie file (duration: 01m 04s) [18:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:07] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10AndyRussG) Adding Traffic and Operations tags to ask for input about how Varnish caching for this new redirectin... [18:22:26] (03PS1) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610127 (https://phabricator.wikimedia.org/T256095) [18:22:47] bpirkle: ^ [18:22:54] (03CR) 10QChris: [C: 03+1] Gerrit: Rename ssh_host_key to ssh_host_rsa_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [18:23:58] !log andrew@deploy1001 Started deploy [horizon/deploy@a39e86c]: update proxy UI to support editing existing proxies [18:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:23] !log andrew@deploy1001 Finished deploy [horizon/deploy@a39e86c]: update proxy UI to support editing existing proxies (duration: 03m 26s) [18:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:57] (03CR) 10Paladox: [C: 03+1] gerrit: stop rsyncing to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609883 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:30:15] (03CR) 10Paladox: [C: 03+1] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:30:34] (03PS2) 10Dzahn: gerrit: stop rsyncing to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609883 (https://phabricator.wikimedia.org/T239151) [18:30:59] 10Operations, 10Analytics, 10Traffic: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10Ottomata) [18:31:08] 10Operations, 10Analytics, 10Traffic: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10Ottomata) [18:31:13] (03CR) 10Dzahn: [C: 03+2] gerrit: stop rsyncing to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609883 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:32:19] Notice: /Stage[main]/Ferm/File[/etc/ferm/conf.d/10_gerrit-migration-rsync]/ensure: removed [18:32:44] noop on prod gerrit. firewall closed on test gerrit. [18:34:56] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:35:29] !log andrew@deploy1001 Started deploy [horizon/deploy@eaa056e]: fix for proxy editing --bug 610130 [18:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:37] (03CR) 10BryanDavis: [C: 03+1] Change toolforge error pages to use toolforge logo instead of toollabs logo [puppet] - 10https://gerrit.wikimedia.org/r/610026 (owner: 10Majavah) [18:35:39] (03CR) 10QChris: [C: 03+1] zuul: remove gerrit-test connection and setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:37:56] (03PS2) 10Dzahn: zuul: remove gerrit-test connection and setup [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) [18:38:47] !log andrew@deploy1001 Finished deploy [horizon/deploy@eaa056e]: fix for proxy editing --bug 610130 (duration: 03m 18s) [18:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:56] (03PS3) 10Dzahn: zuul: remove gerrit-test connection and setup [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) [18:40:03] (03CR) 10QChris: [C: 03+1] zuul: remove gerrit-test connection and setup [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:40:18] (03CR) 10Dzahn: [C: 03+2] zuul: remove gerrit-test connection and setup [puppet] - 10https://gerrit.wikimedia.org/r/609875 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:40:39] 10Operations, 10Wikimedia-Mailing-lists: "Uncaught bounce notification" from Yahoo and AOL - https://phabricator.wikimedia.org/T257241 (10Aklapper) There are bounce processing options mentioned on https://meta.wikimedia.org/wiki/Mailing_lists/Administration in case you want to fiddle. :) [18:40:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10dr0ptp4kt) I believe what's appropriate here is the set of distinct permissions formed from the union of permissions granted to @Mholl... [18:43:07] Notice: /Stage[main]/Profile::Zuul::Merger/Sshkey[gerrit-test]/ensure: removed [18:43:19] Service[zuul-merger]/Service[zuul-merger]: Triggered 'refresh' [18:45:19] WARNING zuul.GerritEventConnector: Received unrecognized event type 'ref-replicated' from Gerrit. [18:45:33] (03PS1) 10Herron: logstash: set v7 cluster to version 7.8 [puppet] - 10https://gerrit.wikimedia.org/r/610135 (https://phabricator.wikimedia.org/T234854) [18:46:51] These ref-replicated should be ok. [18:47:10] (03PS3) 10Dzahn: acme_chief: remove gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/609878 (https://phabricator.wikimedia.org/T239151) [18:48:15] (03CR) 10Dzahn: [C: 03+2] acme_chief: remove gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/609878 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:48:27] (03CR) 10Cwhite: [C: 03+2] mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [18:50:35] acmechief was refreshed in prod. gerrit1002/gerrit-test removed from it [18:51:02] also ran on acmechief-test [18:52:37] running puppet on gerrit servers which also removed acmecerts and refreshes apache2 [18:52:53] everything still looking fine [18:54:04] well, puppet broke on gerrit1002 but that's the one we are removing [18:54:44] (03PS1) 10Dzahn: gerrit: remove absented host key file for gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/610139 (https://phabricator.wikimedia.org/T239151) [18:55:28] https://gerrit.wikimedia.org and https://gerrit-replica.wikimedia.org still show good certs. [18:55:43] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): Turn off cache for up to one week on test wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10Krinkle) >>! **Task description** > […] on test wikis […] > > * Which of these wikis […] hewiki, eu... [18:55:57] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): Turn off cache for up to one week on test wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10Krinkle) a:05Krinkle→03None [18:56:02] (03CR) 10Dzahn: "puppet ran on all gerrit servers meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/610139 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:56:28] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): Turn off CDN cache for up to one week on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10Krinkle) [18:57:36] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10MBeat33) Thanks to everyone working on this. I'd like to suggest this task be considered High priority, as we ha... [18:58:05] (03CR) 10Dzahn: "eh, of course I should have said "contint servers" here." [puppet] - 10https://gerrit.wikimedia.org/r/610139 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:59:44] qchris: interesting detail. the zuul/.ssh/known_hosts on contint. it has been cleaned up, gerrit-test is gone. so on contint1001 there is one host key left, the one for gerrit.wikimedia.org .. but on contint2001 there are 10 different keys in there [19:00:00] :-D [19:00:04] twentyafterfour and James_F: Dear deployers, time to do the Mediawiki train - American+European Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T1900). [19:01:14] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1001/23752/" [puppet] - 10https://gerrit.wikimedia.org/r/610135 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:01:44] no "10 different keys" is wrong. it's the sam key repeatedly [19:01:50] (03CR) 10QChris: [C: 04-1] "This change comes with a topic of `gerrit-cleanup`, so I had a look." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [19:02:09] (03CR) 10Dzahn: [C: 03+2] gerrit: remove absented host key file for gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/610139 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:02:12] Nice. [19:03:12] i'll make a backup, move it, and run puppet [19:04:34] !log contint2001 - move /var/lib/zuul/.ssh/known_hosts to root and run puppet to recreate it [19:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:23] yep, that did it. it's like on contint1001 now, just one key line [19:06:01] \o/ [19:06:39] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [19:08:12] (03PS2) 10Dzahn: site/DHCP/partman: decom gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) [19:08:35] (03CR) 10QChris: "> Yea, the topic is an accident. Fixing it." [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [19:09:57] (03PS1) 1020after4: group0 wikis to 1.35.0-wmf.40 refs T256668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610141 [19:09:59] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.35.0-wmf.40 refs T256668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610141 (owner: 1020after4) [19:10:06] (03CR) 10Dzahn: httpbb: update test case for annual.wikimedia.org to 2019 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [19:11:38] (03PS1) 10Dzahn: site: remove gerrit role from gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/610142 (https://phabricator.wikimedia.org/T239151) [19:13:47] 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) To traffic and SRE folks: Where are we now on this with regards to load and priority? Was the reduction "enough" that the other ideas... [19:13:57] mutante: gerrit1002 can get cleaned up as far as I can tell. I pinged t-hcipriani and he also said that he does not need anything backed up from that VM. [19:14:36] qchris: thanks for confirming. will remove the role. checking icinga. still in downtime so we won't have alerts when things stop [19:14:48] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.40 refs T256668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610141 (owner: 1020after4) [19:14:56] (03CR) 10Dzahn: [C: 03+2] site: remove gerrit role from gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/610142 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:15:03] (03PS2) 10Dzahn: site: remove gerrit role from gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/610142 (https://phabricator.wikimedia.org/T239151) [19:18:35] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.40 refs T256668 [19:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:41] T256668: 1.35.0-wmf.40 deployment blockers - https://phabricator.wikimedia.org/T256668 [19:19:14] (03PS3) 10Dzahn: site/DHCP/partman: decom gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) [19:20:30] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): Turn off CDN cache for up to one week on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10Jdlrobson) @krinkle for background this ticket comes from a meeting with @BBlack who sugges... [19:24:13] (03CR) 10QChris: [C: 03+1] remove gerrit-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/609886 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:24:31] (03CR) 10QChris: [C: 03+1] remove gerrit1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/609887 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:25:19] (03CR) 10QChris: [C: 03+1] site/DHCP/partman: decom gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:26:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [19:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:56] !log destroying VM gerrit1002 - decom cookbook [19:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:15] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `gerrit1002.wikimedia.org` - gerrit1002.wikimedia.org (**PASS**) - Downtim... [19:32:45] (03CR) 10Dzahn: [C: 03+2] site/DHCP/partman: decom gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:32:52] (03CR) 10Dzahn: [C: 03+2] "decom cookbook ran" [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:33:59] (03CR) 10Dzahn: "removed from icinga" [puppet] - 10https://gerrit.wikimedia.org/r/609879 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:34:35] (03PS2) 10Dzahn: remove gerrit-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/609886 (https://phabricator.wikimedia.org/T239151) [19:35:25] (03CR) 10Dzahn: [C: 03+2] "The VM that used to host this has been decom'ed." [dns] - 10https://gerrit.wikimedia.org/r/609886 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:36:58] (03PS2) 10Dzahn: remove gerrit1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/609887 (https://phabricator.wikimedia.org/T239151) [19:39:33] (03CR) 10Dzahn: [C: 03+2] "host is down" [dns] - 10https://gerrit.wikimedia.org/r/609887 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:41:20] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) Confirm Confirmed: Service Request 1029100504 was successfully submitted. [19:41:58] (03CR) 10Dzahn: "@jcrespo Assigning to let you know this is ready. Wasn't sure if you'd prefer to merge it because ferm on db servers." [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:43:47] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Open→03Resolved The VM has been removed from all places in the repos, puppet and DNS. The decom cookbook destroyed it and removed it from monitoring, pup... [19:46:43] (03CR) 10BPirkle: [C: 03+1] "Looks good, approved for self merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610127 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [19:47:17] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Analytics-Radar: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10Jclark-ctr) @elukey notebook1004 to an-scheduler1001 i do not see this in netbox. either names. [19:51:02] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Analytics-Radar: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10Dzahn) @Jclark-ctr They are [[ https://netbox.wikimedia.org/dcim/devices/210/ | device 210 ]] and [[ https://netbox.wikimedia.org/dcim/devices/702/ | devi... [19:52:24] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Analytics-Radar: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10Jclark-ctr) @Dzahn sorry that was my mistake. Thanks! [19:52:26] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/610135 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:54:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10Mholloway) TL;DR The list that @Jgiannelos provided in T257187#6280766 looks good to me. I did some digging, and @bearND and I were a... [19:54:45] (03PS2) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610127 (https://phabricator.wikimedia.org/T256095) [19:54:47] (03PS1) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610147 (https://phabricator.wikimedia.org/T256095) [19:54:49] (03PS1) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610148 (https://phabricator.wikimedia.org/T256095) [19:54:51] (03PS1) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (group2; all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610149 (https://phabricator.wikimedia.org/T256095) [19:56:00] (03CR) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (group2; all wikis) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610149 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [20:00:13] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Analytics-Radar: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10Jclark-ctr) 05Open→03Resolved Relabled host and resolved ticket [20:06:37] 10Operations, 10Analytics-Radar, 10Traffic, 10Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817 (10Ottomata) 05Open→03Resolved a:03Ottomata Going to resolve this instead of declining. For EventBus generated events, including... [20:14:01] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:14:11] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:15:35] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:55] PROBLEM - DPKG on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:17:39] !log ppchelko@deploy1001 Started deploy [restbase/deploy@05b8bd5]: Remove restbase2009 [20:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:47] PROBLEM - Check size of conntrack table on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:19:43] PROBLEM - configured eth on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:20:17] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:20:25] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [20:20:33] PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:05] PROBLEM - dhclient process on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:21:53] PROBLEM - MD RAID on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:22:17] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [20:23:10] !log kubernetes1001 - starting nagios-nrpe-server [20:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] RECOVERY - Check size of conntrack table on kubernetes1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:24:07] !log kubernetes1003 - starting nagios-nrpe-server [20:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:19] RECOVERY - Check size of conntrack table on kubernetes1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:24:45] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:39] (03PS1) 10C. Scott Ananian: Explicitly set visualeditor-enable to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610156 (https://phabricator.wikimedia.org/T248343) [20:29:01] PROBLEM - Disk space on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1003&var-datasource=eqiad+prometheus/ops [20:29:38] 10Operations, 10Analytics-Radar, 10Traffic, 10Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817 (10Tgr) The task was about connecting webrequest data and MediaWiki API logs (or more generally, MediaWiki logs), though, and webreques... [20:29:51] PROBLEM - Check size of conntrack table on kubernetes1003 is CRITICAL: connect to address 10.64.32.23 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:31:33] RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:41] RECOVERY - Check size of conntrack table on kubernetes1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:31:54] 10Operations, 10Analytics-Radar, 10Traffic, 10Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817 (10Ottomata) 05Resolved→03Open Hm, you are right, but that is not clear from the task description. I'll edit it and leave open. I... [20:31:57] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:32:07] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@05b8bd5]: Remove restbase2009 (duration: 14m 28s) [20:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:12] !log ppchelko@deploy1001 Started deploy [restbase/deploy@05b8bd5]: Remove restbase2009, take 2 [20:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:21] 10Operations, 10Analytics-Radar, 10Traffic, 10Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817 (10Ottomata) [20:32:43] RECOVERY - MD RAID on kubernetes1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:32:44] 10Operations, 10Analytics-Radar, 10Traffic, 10Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817 (10Ottomata) [20:33:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): decom cloudvirt1015 - https://phabricator.wikimedia.org/T257366 (10Andrew) [20:35:19] (03PS1) 10Ottomata: Add wgEventServiceDefault to refactor EventBus event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610160 (https://phabricator.wikimedia.org/T229863) [20:38:04] 10Operations, 10Icinga, 10observability: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238 (10Dzahn) This _might_ be moot now if we handle all the contact info in VictorOps. But what about people who are not SRE and have icinga contacts? [20:41:27] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@05b8bd5]: Remove restbase2009, take 2 (duration: 09m 15s) [20:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:43] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10DStrine) [20:43:07] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:45:51] (03CR) 10Ppchelko: [C: 03+1] Add wgEventServiceDefault to refactor EventBus event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610160 (https://phabricator.wikimedia.org/T229863) (owner: 10Ottomata) [20:46:45] RECOVERY - DPKG on kubernetes1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:49:51] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1003&var-datasource=eqiad+prometheus/ops [20:50:31] RECOVERY - configured eth on kubernetes1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:51:13] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1003 is OK: OK: synced at Tue 2020-07-07 20:51:12 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:51:55] RECOVERY - dhclient process on kubernetes1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:53:07] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1001 is OK: OK: synced at Tue 2020-07-07 20:53:05 UTC. https://wikitech.wikimedia.org/wiki/NTP [21:07:31] !log andrew@deploy1001 Started deploy [horizon/deploy@abcd051]: further fixes for proxy editing --bug 610130 [21:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:57] !log andrew@deploy1001 Finished deploy [horizon/deploy@abcd051]: further fixes for proxy editing --bug 610130 (duration: 03m 26s) [21:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:41] !log andrew@deploy1001 Started deploy [horizon/deploy@fce8183]: further fixes for proxy editing --bug 610130 [21:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) updated netbox with cable id`s [21:29:16] !log andrew@deploy1001 Finished deploy [horizon/deploy@fce8183]: further fixes for proxy editing --bug 610130 (duration: 03m 35s) [21:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:06] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris I will be on site tomorrow also if host is available to do 1 day earlier [21:33:53] (03PS1) 10Bstorm: paws: add project to our prometheus alert-manager system [puppet] - 10https://gerrit.wikimedia.org/r/610175 (https://phabricator.wikimedia.org/T256361) [22:09:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Edtadros) Hi @ssingh, I signed it. Thanks in advance! [22:24:56] 10Operations, 10Analytics-Radar, 10Traffic, 10Privacy: Add request_id to webrequest logs as well as other event records ingested into Hadoop - https://phabricator.wikimedia.org/T113817 (10Tgr) >>! In T113817#6287140, @Ottomata wrote: > Hm, you are right, but that is not clear from the task description. I... [22:35:23] (03CR) 10Dzahn: [C: 03+2] annualreport: update redirect from 2018 to 2019 report [puppet] - 10https://gerrit.wikimedia.org/r/609888 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [22:38:45] 10Operations, 10WMF-Annual-Report, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2019 Annual Report - https://phabricator.wikimedia.org/T257257 (10Dzahn) a:03Dzahn [22:40:50] 10Operations, 10WMF-Annual-Report, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2019 Annual Report - https://phabricator.wikimedia.org/T257257 (10Dzahn) @spatton This is done. You can see the actual code change at https://gerrit.wikimedia.org/r/c/operations/puppet/+/609888/2/module... [22:41:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10thcipriani) >>! In T257187#6286933, @Mholloway wrote: > I did some digging, and @bearND and I were added to the `deployment` group in... [22:41:48] !log new Wikimedia Annual Report 2019 now available on annual.wikimedia.org [22:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:48] (03CR) 10Dzahn: [C: 03+2] httpbb: update test case for annual.wikimedia.org to 2019 [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [22:47:31] (03CR) 10Dzahn: "[cumin1001:~] $ httpbb /srv/deployment/httpbb-tests/test_miscweb.yaml --hosts=miscweb1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/609891 (https://phabricator.wikimedia.org/T257257) (owner: 10Dzahn) [22:51:07] it's the cloud again [22:51:23] stuff is too centralized [22:53:08] (03CR) 10Bstorm: [C: 03+1] "So, this seems good and it works ok in testing. However, one side-effect is that Toolsbeta k8s proxy is effectively broken. It tries to go" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [22:54:03] (03CR) 10Bstorm: [C: 03+1] "This should work because it deletes by labels, so I think it'll tidy up after itself." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609856 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [22:56:03] (03CR) 10Bstorm: [C: 03+1] "Kill it with 🔥" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609857 (https://phabricator.wikimedia.org/T257229) (owner: 10BryanDavis) [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200707T2300). [23:00:16] (03CR) 10BryanDavis: [C: 03+2] Remove --canonical argument to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [23:00:27] (03CR) 10BryanDavis: [C: 03+2] kubernetes: remove legacy ingress generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609856 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [23:00:45] (03CR) 10BryanDavis: [C: 03+2] Remove $HOME/.webservicerc support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609857 (https://phabricator.wikimedia.org/T257229) (owner: 10BryanDavis) [23:00:58] (03Merged) 10jenkins-bot: Remove --canonical argument to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [23:01:05] (03Merged) 10jenkins-bot: kubernetes: remove legacy ingress generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609856 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [23:01:12] (03Merged) 10jenkins-bot: Remove $HOME/.webservicerc support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609857 (https://phabricator.wikimedia.org/T257229) (owner: 10BryanDavis) [23:09:06] (03PS1) 10BryanDavis: d/changelog: prepare for 0.73 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/610181 [23:09:53] (03CR) 10BryanDavis: [C: 03+2] d/changelog: prepare for 0.73 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/610181 (owner: 10BryanDavis) [23:10:49] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.73 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/610181 (owner: 10BryanDavis) [23:45:37] (03PS1) 10Bstorm: tools-prometheus: set up prometheus to get paws metrics [puppet] - 10https://gerrit.wikimedia.org/r/610189 (https://phabricator.wikimedia.org/T256361) [23:46:53] (03CR) 10jerkins-bot: [V: 04-1] tools-prometheus: set up prometheus to get paws metrics [puppet] - 10https://gerrit.wikimedia.org/r/610189 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [23:47:36] 10Operations, 10WMF-Annual-Report: Update annual.wikimedia.org redirect to point to 2019 Annual Report - https://phabricator.wikimedia.org/T257257 (10Dzahn) 05Open→03Resolved [23:47:38] 10Operations, 10WMF-Annual-Report, 10serviceops: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10Dzahn) [23:51:20] (03PS2) 10Bstorm: tools-prometheus: set up prometheus to get paws metrics [puppet] - 10https://gerrit.wikimedia.org/r/610189 (https://phabricator.wikimedia.org/T256361)