[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T0000). Please do the needful. [00:00:39] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [00:14:51] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10Dzahn) 05Open→03Resolved @holger.knust You have been upgraded from "restricted" to "deployment". I ran puppet on deploy1001.eqiad.wmnet and saw it make the change. On all other ho... [00:15:23] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10Dzahn) [00:21:31] (03PS5) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [00:38:30] (03PS6) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [00:39:31] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [00:41:38] (03PS7) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [00:44:37] (03PS8) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [01:16:36] (03PS4) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [01:22:25] (03CR) 10Dzahn: "result of compiling on "C:profile::cache::base": https://puppet-compiler.wmflabs.org/compiler1001/24906/" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [03:25:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:26:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:12:29] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 85 probes of 565 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:18:21] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 51 probes of 565 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:53:00] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Ammarpad) [05:41:37] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:54:20] (03PS2) 10Jcrespo: Remove wmfbackups from wmfmariadbpy repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 [05:56:35] (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 (owner: 10Jcrespo) [06:04:33] (03CR) 10Jcrespo: "My guess it is a leftover of when we used to deploy the grants locally. I removed that functionality because it was a security concern (pl" [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [06:24:17] !log Disconnect eqiad -> codfw replication [06:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:49] (03PS5) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [06:35:21] (03PS4) 10KartikMistry: Add --notify-age-in-days option to notify users before draft purge [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) [06:36:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:40:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12441 and previous config saved to /var/cache/conftool/dbconfig/20200903-064334-marostegui.json [06:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:41] T261869: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 [06:44:48] (03CR) 10Volans: [C: 04-1] "Signature doesn't match, see inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [06:46:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12442 and previous config saved to /var/cache/conftool/dbconfig/20200903-064623-marostegui.json [06:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:47] 10Operations: Expired puppet certificates - https://phabricator.wikimedia.org/T260110 (10Volans) 05Open→03Resolved a:03Volans @Dzahn ack [06:46:49] 10Operations: Expired puppet certificates - https://phabricator.wikimedia.org/T260110 (10Volans) [06:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12443 and previous config saved to /var/cache/conftool/dbconfig/20200903-064804-marostegui.json [06:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12444 and previous config saved to /var/cache/conftool/dbconfig/20200903-065105-marostegui.json [06:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:12] T261869: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 [06:52:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Convert mobileapps to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623739 (owner: 10Giuseppe Lavagetto) [06:54:05] (03Merged) 10jenkins-bot: Convert mobileapps to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623739 (owner: 10Giuseppe Lavagetto) [06:54:41] (03CR) 10Hashar: "> noop on deploy1001 and compiler showed noop on all these other hosts using scap classes" [puppet] - 10https://gerrit.wikimedia.org/r/623078 (owner: 10Dzahn) [06:54:56] going to restart Jenkins CI [06:56:20] <_joe_> !log deployment of mobileapps to pick up changes to envoy config, new helmfile layout [06:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:28] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [06:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:33] !log contint2001: restarting CI Jenkins [06:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12445 and previous config saved to /var/cache/conftool/dbconfig/20200903-070104-marostegui.json [07:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:12] T261869: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 [07:02:26] !log Stop db2100:3317 and db2121 in sync to reload metawiki.content T261869 [07:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:14] 10Operations, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10jcrespo) Save times keep being back to previous levels, aproximatelly. For historical purposes, was something mo... [07:14:04] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10elukey) Reporting the SAL entries that we mistakenly logged to another task: Mentioned in SAL (#wikimedia-operations) [2020-09-02T14:28:58Z] execute kafka topics --... [07:16:37] 10Operations: docker-registry.wikimedia.org/golang:1.11 should no more depends on stretch-backports - https://phabricator.wikimedia.org/T261920 (10hashar) [07:17:01] 10Operations: docker-registry.wikimedia.org/golang:1.11 should no more depends on stretch-backports - https://phabricator.wikimedia.org/T261920 (10hashar) [07:17:03] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) [07:17:35] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) [07:18:07] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) The CI containers maintained at integration/config.git no more rely on `stretch-backports`. In 1122fbdf11e1f681283db19a6b57415eab5875f3 I went to delete `/etc/apt/sources.list.d/backport... [07:18:39] !log Stop slave on s8 eqiad master (lag will appear on s8 eqiad) - T237120 [07:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:45] T237120: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 [07:18:58] (03PS1) 10Giuseppe Lavagetto: mobileapps: fix staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/623917 [07:19:27] 10Operations: docker-registry.wikimedia.org/golang:1.11 should no more depends on stretch-backports - https://phabricator.wikimedia.org/T261920 (10hashar) I don't quite know where golang:1.11 is used. The only use case I found was to build pebble for the CI tox-acme-chief image. That got moved to golang:1.13 vi... [07:19:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: fix staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/623917 (owner: 10Giuseppe Lavagetto) [07:19:46] !log Deploy schema change on s8 eqiad master T237120 [07:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:28] (03CR) 10jerkins-bot: [V: 04-1] mobileapps: fix staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/623917 (owner: 10Giuseppe Lavagetto) [07:24:21] !log contint2001: restarting CI Jenkins for plugins upgrade [07:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:16] 10Operations, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10elukey) Everything started by this graph: https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgI... [07:26:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623917 (owner: 10Giuseppe Lavagetto) [07:27:12] (03CR) 10Hashar: "recheck CI config has been deployed https://gerrit.wikimedia.org/r/c/integration/config/+/623915" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 (owner: 10Jcrespo) [07:27:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12446 and previous config saved to /var/cache/conftool/dbconfig/20200903-072716-marostegui.json [07:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:23] T261869: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 [07:27:45] (03CR) 10Hashar: "recheck CI config has been deployed https://gerrit.wikimedia.org/r/c/integration/config/+/623915" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623532 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:27:48] (03CR) 10jerkins-bot: [V: 04-1] Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 (owner: 10Jcrespo) [07:28:49] (03Merged) 10jenkins-bot: mobileapps: fix staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/623917 (owner: 10Giuseppe Lavagetto) [07:28:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623532 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:28:57] (03CR) 10Hashar: "recheck CI config has been deployed https://gerrit.wikimedia.org/r/c/integration/config/+/623915" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623754 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [07:29:12] (03CR) 10Jcrespo: "Known issue, dependency reference of wmfmariadbpy works locally but not on CI." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 (owner: 10Jcrespo) [07:29:52] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [07:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12447 and previous config saved to /var/cache/conftool/dbconfig/20200903-073116-marostegui.json [07:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:23] (03PS2) 10Giuseppe Lavagetto: mobileapps: use the reserved port for TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/623740 [07:34:25] (03PS2) 10Giuseppe Lavagetto: Convert cxserver to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623749 (https://phabricator.wikimedia.org/T258572) [07:35:10] (03PS1) 10Giuseppe Lavagetto: mobileapps: fix nontls release name typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/623920 [07:35:27] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mobileapps: fix nontls release name typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/623920 (owner: 10Giuseppe Lavagetto) [07:37:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12448 and previous config saved to /var/cache/conftool/dbconfig/20200903-073718-marostegui.json [07:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:25] T261869: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 [07:38:33] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [07:38:33] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [07:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:26] (03PS6) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [07:42:33] (03CR) 10Jcrespo: [C: 03+2] Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 (owner: 10Jcrespo) [07:42:58] (03PS7) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [07:44:20] (03PS2) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623532 (https://phabricator.wikimedia.org/T138562) [07:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12449 and previous config saved to /var/cache/conftool/dbconfig/20200903-074426-marostegui.json [07:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:32] T261869: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 [07:44:59] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623532 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:45:11] !log Upgrade and reboot db1094 [07:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:27] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [07:45:27] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [07:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:31] (03Merged) 10jenkins-bot: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623532 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3317 T261917', diff saved to https://phabricator.wikimedia.org/P12450 and previous config saved to /var/cache/conftool/dbconfig/20200903-074827-marostegui.json [07:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:34] T261917: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 [07:48:49] (03PS1) 10Jon Harald Søby: Add extra namespaces for jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623921 (https://phabricator.wikimedia.org/T260320) [07:49:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12451 and previous config saved to /var/cache/conftool/dbconfig/20200903-074922-marostegui.json [07:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:51] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ayounsi) a:05ayounsi→03Cmjohnson >>! In T259071#6430943, @Cmjohnson wrote: > @ayounsi can you add the analytics vlan to cloud... [07:52:24] (03CR) 10Urbanecm: [C: 03+1] "LGTM, I'll deploy that soon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623921 (https://phabricator.wikimedia.org/T260320) (owner: 10Jon Harald Søby) [07:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12452 and previous config saved to /var/cache/conftool/dbconfig/20200903-075443-marostegui.json [07:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) 05Resolved→03Open console0 and me0 (mgmt) still show as not connected. [08:00:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12453 and previous config saved to /var/cache/conftool/dbconfig/20200903-080024-marostegui.json [08:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2086:3317', diff saved to https://phabricator.wikimedia.org/P12454 and previous config saved to /var/cache/conftool/dbconfig/20200903-080634-marostegui.json [08:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:46] !log Upgrade and reboot db1127 [08:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12455 and previous config saved to /var/cache/conftool/dbconfig/20200903-080714-marostegui.json [08:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2086:3317', diff saved to https://phabricator.wikimedia.org/P12456 and previous config saved to /var/cache/conftool/dbconfig/20200903-081337-marostegui.json [08:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:36] (03PS1) 10JMeybohm: configmaster: add helm-charts to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/623958 (https://phabricator.wikimedia.org/T253843) [08:15:00] (03PS2) 10Jcrespo: backup_mariadb: Use path to find backup_mariadb.py [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623754 (https://phabricator.wikimedia.org/T165358) [08:15:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318, db1101:3317', diff saved to https://phabricator.wikimedia.org/P12457 and previous config saved to /var/cache/conftool/dbconfig/20200903-081503-marostegui.json [08:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2086:3318', diff saved to https://phabricator.wikimedia.org/P12458 and previous config saved to /var/cache/conftool/dbconfig/20200903-081543-marostegui.json [08:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:18] !log Upgrade db1101 (s7 and s8) [08:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:08] (03CR) 10JMeybohm: [C: 04-1] "> Are there other states besides lvs_setup, monitoring_setup and production?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [08:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2086:3317', diff saved to https://phabricator.wikimedia.org/P12459 and previous config saved to /var/cache/conftool/dbconfig/20200903-082034-marostegui.json [08:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:02] (03Abandoned) 10Hashar: DO NOT MERGE apache tweak for doc.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/620928 (owner: 10Hashar) [08:26:33] (03PS11) 10JMeybohm: sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) [08:26:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2086:3317', diff saved to https://phabricator.wikimedia.org/P12460 and previous config saved to /var/cache/conftool/dbconfig/20200903-082655-marostegui.json [08:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:54] (03PS3) 10DCausse: [cirrusdumps] use temp dir and add better error handling [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) [08:28:45] !log rebooting mwmaint1002 for kernel update [08:28:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2122 T261917', diff saved to https://phabricator.wikimedia.org/P12461 and previous config saved to /var/cache/conftool/dbconfig/20200903-082956-marostegui.json [08:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:05] T261917: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 [08:30:06] 04Critical Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged [08:30:56] (03CR) 10Kormat: [C: 03+1] "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 (owner: 10Jcrespo) [08:31:36] (03PS4) 10JMeybohm: helmfile: refactor eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) [08:31:56] (03CR) 10DCausse: "tested on deployment-prep and it worked well:" [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) (owner: 10DCausse) [08:32:26] (03CR) 10JMeybohm: helmfile: refactor eventgate-analytics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:33:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: extend ferm rules to cover more ports [puppet] - 10https://gerrit.wikimedia.org/r/623769 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [08:34:58] (03CR) 10Filippo Giunchedi: [C: 03+2] Add ms-be2057 to swift firewall [puppet] - 10https://gerrit.wikimedia.org/r/623779 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [08:35:00] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10jbond) @Cmjohnson. Sorry I had assumed it was similar enough to the pki2001 server. could we use the following recipe instead then git/puppet/modules/install_server/files/aut... [08:35:06] (03PS2) 10Filippo Giunchedi: Add ms-be2057 to swift firewall [puppet] - 10https://gerrit.wikimedia.org/r/623779 (https://phabricator.wikimedia.org/T261633) [08:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2122', diff saved to https://phabricator.wikimedia.org/P12462 and previous config saved to /var/cache/conftool/dbconfig/20200903-084147-marostegui.json [08:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3317, db1090:3312', diff saved to https://phabricator.wikimedia.org/P12463 and previous config saved to /var/cache/conftool/dbconfig/20200903-084358-marostegui.json [08:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:39] (03CR) 10Jbond: "Looks good but see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [08:48:08] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Elitre) [08:48:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1090:3312', diff saved to https://phabricator.wikimedia.org/P12464 and previous config saved to /var/cache/conftool/dbconfig/20200903-084836-marostegui.json [08:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2122', diff saved to https://phabricator.wikimedia.org/P12465 and previous config saved to /var/cache/conftool/dbconfig/20200903-084910-marostegui.json [08:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:32] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Elitre) [08:54:08] (03PS1) 10Filippo Giunchedi: statsd_exporter: stop tracking local statsd connections [puppet] - 10https://gerrit.wikimedia.org/r/623966 (https://phabricator.wikimedia.org/T261633) [08:57:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2122', diff saved to https://phabricator.wikimedia.org/P12466 and previous config saved to /var/cache/conftool/dbconfig/20200903-085708-marostegui.json [08:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:29] 10Operations, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Joe) We have moved restbase-async to eqiad again, as the load was still too high. We might have to consider expa... [08:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1090:3317', diff saved to https://phabricator.wikimedia.org/P12467 and previous config saved to /var/cache/conftool/dbconfig/20200903-085838-marostegui.json [08:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:01] (03CR) 10Jbond: "minor nit, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [09:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317, db1098:3316', diff saved to https://phabricator.wikimedia.org/P12468 and previous config saved to /var/cache/conftool/dbconfig/20200903-090007-marostegui.json [09:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:51] !log force ae2.1118 VRRP master on cr1-eqiad - T261866 [09:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:56] T261866: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 [09:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P12469 and previous config saved to /var/cache/conftool/dbconfig/20200903-090419-marostegui.json [09:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:19] !log move vlan 1118 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866 [09:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2122', diff saved to https://phabricator.wikimedia.org/P12470 and previous config saved to /var/cache/conftool/dbconfig/20200903-090901-marostegui.json [09:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:04] XioNoX: uh, bunch of cloud hosts are alerting as down [09:11:27] legoktm: ok, rolling back [09:11:35] they're not showing up here? [09:11:47] they're in #wikimedia-cloud-feed [09:11:50] well they're coming back now [09:11:54] legoktm: rolled back, should be good now [09:12:18] https://paste.debian.net/1162409/ for reference [09:12:24] ack [09:13:01] thx! [09:13:30] !log rolled back: move vlan 1118 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866 [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:38] T261866: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 [09:14:40] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Add mappings for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [09:16:36] (03Merged) 10jenkins-bot: api-gateway: Add mappings for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [09:18:27] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 2 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [09:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P12471 and previous config saved to /var/cache/conftool/dbconfig/20200903-091834-marostegui.json [09:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2087:3316 db2087:3317 T261917', diff saved to https://phabricator.wikimedia.org/P12472 and previous config saved to /var/cache/conftool/dbconfig/20200903-092028-marostegui.json [09:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:36] T261917: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 [09:24:23] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01385 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:25:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2087:3316', diff saved to https://phabricator.wikimedia.org/P12473 and previous config saved to /var/cache/conftool/dbconfig/20200903-092549-marostegui.json [09:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:00] weird I see 3 nodes failures only [09:26:20] 0.01385 ge 0.01 [09:28:12] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [09:28:29] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [09:29:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [09:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:43] (03PS1) 10Jbond: cfssl: convert mapped Hash back to a hash [puppet] - 10https://gerrit.wikimedia.org/r/623974 [09:30:37] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:31:21] (03PS2) 10Jbond: cfssl: convert mapped Hash back to a hash [puppet] - 10https://gerrit.wikimedia.org/r/623974 [09:31:26] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:00] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:59] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:33:12] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) [09:33:33] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) 05Open→03Resolved a:03JMeybohm All done here. [09:33:36] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [09:34:47] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) [09:34:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2087:3316', diff saved to https://phabricator.wikimedia.org/P12474 and previous config saved to /var/cache/conftool/dbconfig/20200903-093454-marostegui.json [09:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:06] (03CR) 10Jcrespo: "> Patch Set 2: Code-Review+1" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 (owner: 10Jcrespo) [09:35:23] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) [09:35:53] (03PS1) 10Arturo Borrero Gonzalez: nftables::file: break puppet dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/623978 [09:36:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2087:3317', diff saved to https://phabricator.wikimedia.org/P12475 and previous config saved to /var/cache/conftool/dbconfig/20200903-093629-marostegui.json [09:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] !log move vlan 1118 IPv6 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866 [09:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:24] T261866: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 [09:39:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables::file: break puppet dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/623978 (owner: 10Arturo Borrero Gonzalez) [09:40:41] arturo: ok, this time it's working [09:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2087:3316', diff saved to https://phabricator.wikimedia.org/P12476 and previous config saved to /var/cache/conftool/dbconfig/20200903-094043-marostegui.json [09:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:48] 👍 [09:41:44] (03PS3) 10Jbond: cfssl: convert mapped Hash back to a hash [puppet] - 10https://gerrit.wikimedia.org/r/623974 [09:42:34] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [09:43:20] (03CR) 10Jbond: [C: 03+2] cfssl: convert mapped Hash back to a hash [puppet] - 10https://gerrit.wikimedia.org/r/623974 (owner: 10Jbond) [09:43:51] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: fix hiera variable name [puppet] - 10https://gerrit.wikimedia.org/r/623980 [09:44:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2087:3317', diff saved to https://phabricator.wikimedia.org/P12477 and previous config saved to /var/cache/conftool/dbconfig/20200903-094435-marostegui.json [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: fix hiera variable name [puppet] - 10https://gerrit.wikimedia.org/r/623980 (owner: 10Arturo Borrero Gonzalez) [09:46:31] !log move vlan 1118 IPv4 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866 [09:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:37] T261866: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 [09:46:52] arturo: looks like v4 is good too now [09:47:11] XioNoX: ACK [09:47:16] (03PS1) 10Jcrespo: mariadb: Add content to the list of tables to check with db-compare [software] - 10https://gerrit.wikimedia.org/r/623981 (https://phabricator.wikimedia.org/T261917) [09:47:36] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Zbyszko) a:03Zbyszko [09:47:45] (03CR) 10Marostegui: [C: 03+1] mariadb: Add content to the list of tables to check with db-compare [software] - 10https://gerrit.wikimedia.org/r/623981 (https://phabricator.wikimedia.org/T261917) (owner: 10Jcrespo) [09:47:52] XioNoX: cumin can reach basically all servers [09:48:01] !log move VRRP master from cr1-eqiad:ae2.1118 to cr2-eqiad:xe-3/0/4.1118 - T261866 [09:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2087:3317', diff saved to https://phabricator.wikimedia.org/P12478 and previous config saved to /var/cache/conftool/dbconfig/20200903-094857-marostegui.json [09:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2087:3316', diff saved to https://phabricator.wikimedia.org/P12479 and previous config saved to /var/cache/conftool/dbconfig/20200903-095015-marostegui.json [09:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:21] (03PS1) 10Arturo Borrero Gonzalez: cumin: aliases: include A:cloudceph in the general A:cloud-eqiad1 alias [puppet] - 10https://gerrit.wikimedia.org/r/623982 [09:50:31] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 2 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [09:51:25] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003149 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:51:40] (03CR) 10Volans: [C: 03+1] "LGTM syntactically" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623982 (owner: 10Arturo Borrero Gonzalez) [09:55:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2087:3317', diff saved to https://phabricator.wikimedia.org/P12480 and previous config saved to /var/cache/conftool/dbconfig/20200903-095510-marostegui.json [09:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:43] !log move vlan 1118 from ae2.1118 to xe-3/0/4.1118 cr2-eqiad - T261866 [09:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:47] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) @Ottomata We would like the HTTP services to be decommissioned from Kubernetes after a service is switched to TLS only in LVS but I think t... [09:56:48] T261866: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 [09:56:50] er [09:57:05] !log rectification: move vlan 1118 from ae2.1118 to xe-3/0/4.1118 on cr1-eqiad - T261866 [09:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:12] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add content to the list of tables to check with db-compare [software] - 10https://gerrit.wikimedia.org/r/623981 (https://phabricator.wikimedia.org/T261917) (owner: 10Jcrespo) [09:57:44] (03Merged) 10jenkins-bot: mariadb: Add content to the list of tables to check with db-compare [software] - 10https://gerrit.wikimedia.org/r/623981 (https://phabricator.wikimedia.org/T261917) (owner: 10Jcrespo) [09:57:59] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [09:58:14] (03PS1) 10Jbond: cfssl: move auth keys to correct section [puppet] - 10https://gerrit.wikimedia.org/r/623985 [09:58:50] (03CR) 10Jbond: [C: 03+2] cfssl: move auth keys to correct section [puppet] - 10https://gerrit.wikimedia.org/r/623985 (owner: 10Jbond) [09:58:53] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [09:59:14] arturo: ok, everything is done, only cleanups are left to do. Thanks for helping! [09:59:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623982 (owner: 10Arturo Borrero Gonzalez) [10:00:03] XioNoX: OK. thanks you! [10:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T1000). [10:02:36] (03PS1) 10Ayounsi: Rename ae2.1118 to xe-3/0/4.1118 [homer/public] - 10https://gerrit.wikimedia.org/r/623988 (https://phabricator.wikimedia.org/T261866) [10:03:27] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:43] (03CR) 10JMeybohm: [C: 03+1] "Just a nit, LTGM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 (owner: 10Hnowlan) [10:05:46] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: use newer nftables and kernel [puppet] - 10https://gerrit.wikimedia.org/r/623991 (https://phabricator.wikimedia.org/T261724) [10:06:48] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: use newer nftables and kernel [puppet] - 10https://gerrit.wikimedia.org/r/623991 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:07:47] !log re-apply vlan 1118 firewall filter and update OSPF/bootp on cr1/2-eqiad - T261866 [10:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:53] T261866: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 [10:08:18] (03PS3) 10Muehlenhoff: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459 [10:09:00] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: use newer nftables and kernel [puppet] - 10https://gerrit.wikimedia.org/r/623991 (https://phabricator.wikimedia.org/T261724) [10:09:37] (03PS1) 10Ayounsi: Only use VRRP bandwidth-threshold on ae links [homer/public] - 10https://gerrit.wikimedia.org/r/623995 (https://phabricator.wikimedia.org/T261866) [10:11:59] 10Operations, 10netops, 10Patch-For-Review: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 (10ayounsi) [10:12:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: use newer nftables and kernel [puppet] - 10https://gerrit.wikimedia.org/r/623991 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:12:43] (03CR) 10Ayounsi: [C: 03+2] "Tested and works as expected." [homer/public] - 10https://gerrit.wikimedia.org/r/623995 (https://phabricator.wikimedia.org/T261866) (owner: 10Ayounsi) [10:12:56] (03CR) 10Ayounsi: [C: 03+2] "Pushed in prod." [homer/public] - 10https://gerrit.wikimedia.org/r/623988 (https://phabricator.wikimedia.org/T261866) (owner: 10Ayounsi) [10:13:27] (03Merged) 10jenkins-bot: Rename ae2.1118 to xe-3/0/4.1118 [homer/public] - 10https://gerrit.wikimedia.org/r/623988 (https://phabricator.wikimedia.org/T261866) (owner: 10Ayounsi) [10:13:34] (03Merged) 10jenkins-bot: Only use VRRP bandwidth-threshold on ae links [homer/public] - 10https://gerrit.wikimedia.org/r/623995 (https://phabricator.wikimedia.org/T261866) (owner: 10Ayounsi) [10:14:29] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10hashar) hi, is that still worked on? Asking cause CI still has to maintain a Jessie based image / NodeJS 6. [10:14:50] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Jenkins, and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10MoritzMuehlenhoff) [10:17:27] (03PS1) 10Jbond: cfssl: add note about warning in logs [puppet] - 10https://gerrit.wikimedia.org/r/624001 [10:18:42] 10Operations, 10netops, 10Patch-For-Review: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 (10ayounsi) 05Open→03Resolved There has been 1 issue: the cr2-eqiad facing interface on cloudsw1-d5 was miss-configured (configured as `L3` instead of `L2` trunk interfa... [10:21:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [10:21:39] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: allow common input traffic [puppet] - 10https://gerrit.wikimedia.org/r/624005 (https://phabricator.wikimedia.org/T261724) [10:22:53] (03PS1) 10Hnowlan: api-gateway: Fix syntax in metrics gathering [deployment-charts] - 10https://gerrit.wikimedia.org/r/624006 (https://phabricator.wikimedia.org/T254910) [10:24:34] (03PS3) 10Giuseppe Lavagetto: Convert cxserver to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623749 (https://phabricator.wikimedia.org/T258572) [10:24:36] (03PS1) 10Giuseppe Lavagetto: Convert echotore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624007 (https://phabricator.wikimedia.org/T258572) [10:24:38] (03PS1) 10Giuseppe Lavagetto: Convert proton to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624008 (https://phabricator.wikimedia.org/T258572) [10:24:40] (03PS1) 10Giuseppe Lavagetto: Convert wikifeeds to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624009 [10:24:42] (03PS1) 10Giuseppe Lavagetto: Convert zotero to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624010 (https://phabricator.wikimedia.org/T258572) [10:25:37] (03CR) 10Effie Mouzeli: [C: 03+1] Add apache config for api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/623833 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [10:26:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: allow common input traffic [puppet] - 10https://gerrit.wikimedia.org/r/624005 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:27:25] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Fix syntax in metrics gathering [deployment-charts] - 10https://gerrit.wikimedia.org/r/624006 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:27:41] (03CR) 10Hnowlan: api-gateway: Fix syntax in metrics gathering [deployment-charts] - 10https://gerrit.wikimedia.org/r/624006 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:28:14] (03PS2) 10Hnowlan: api-gateway: Fix syntax in metrics gathering [deployment-charts] - 10https://gerrit.wikimedia.org/r/624006 (https://phabricator.wikimedia.org/T254910) [10:30:02] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Fix syntax in metrics gathering [deployment-charts] - 10https://gerrit.wikimedia.org/r/624006 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:31:14] (03Merged) 10jenkins-bot: api-gateway: Fix syntax in metrics gathering [deployment-charts] - 10https://gerrit.wikimedia.org/r/624006 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:34:48] (03PS1) 10MSantos: push-notif: drop support to statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/624012 (https://phabricator.wikimedia.org/T260807) [10:36:08] (03PS8) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [10:37:45] (03CR) 10ArielGlenn: [cirrusdumps] use temp dir and add better error handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) (owner: 10DCausse) [10:38:13] (03PS7) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [10:38:58] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:34] (03PS1) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/624014 (https://phabricator.wikimedia.org/T256973) [10:50:00] !log disabling apache on appservers for rollout of https://gerrit.wikimedia.org/r/c/operations/puppet/+/623833 [10:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:56] ugh disabling apache [10:51:03] I mean disabling PUPPET [10:53:14] lol [10:53:18] (03CR) 10Hnowlan: [C: 03+2] Add apache config for api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/623833 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [10:55:11] (03PS3) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) [10:55:27] (03Abandoned) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:56:38] (03PS1) 10Giuseppe Lavagetto: Convert sessionstore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624017 (https://phabricator.wikimedia.org/T258572) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:04:51] (03PS7) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) [11:04:53] (03CR) 10KartikMistry: Add --notify-age-in-days option to notify users before draft purge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) (owner: 10KartikMistry) [11:08:36] (03PS7) 10Gilles: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) [11:11:20] is anyone going to deploy something? [11:11:55] I'd like to deploy a config change, seems like a good time to do it since this window's list is empty [11:16:48] I'll take that as a no [11:16:50] (03CR) 10Gilles: [C: 03+2] Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [11:17:49] (03Merged) 10jenkins-bot: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [11:21:11] gilles: please ping me once you're ready, I'd like to deploy something [11:21:15] *done [11:21:26] Urbanecm: it's going out already, almost done [11:21:33] cool! [11:21:48] !log gilles@deploy1001 Synchronized static/images/project-logos: T252108 Deploying lossily optimised Wikipedia logos (duration: 01m 20s) [11:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:53] Urbanecm: done [11:21:56] T252108: Optimise production wiki logos - https://phabricator.wikimedia.org/T252108 [11:21:57] thanks [11:22:00] (03CR) 10Urbanecm: [C: 03+2] Lift IP cap on 2020-09-08 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623832 (https://phabricator.wikimedia.org/T261882) (owner: 10Urbanecm) [11:22:46] (03CR) 10Urbanecm: [C: 03+2] Add extra namespaces for jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623921 (https://phabricator.wikimedia.org/T260320) (owner: 10Jon Harald Søby) [11:22:50] (03Merged) 10jenkins-bot: Lift IP cap on 2020-09-08 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623832 (https://phabricator.wikimedia.org/T261882) (owner: 10Urbanecm) [11:23:32] (03Merged) 10jenkins-bot: Add extra namespaces for jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623921 (https://phabricator.wikimedia.org/T260320) (owner: 10Jon Harald Søby) [11:26:14] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: 976d7350a7252610e4ba34e9227e205d085a609a: Lift IP cap on 2020-09-08 for Senior Citizen Write Wikipedia course - cs.wikipedia (T261882) (duration: 01m 01s) [11:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:20] T261882: Lift IP cap on 2020-09-08 for Senior Citizen Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T261882 [11:28:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 04281a0875d34e1161f44697f732d898ab12d4f0: Add extra namespaces for jawikivoyage (T260320) (duration: 01m 01s) [11:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:20] T260320: Create Wikivoyage Japanese - https://phabricator.wikimedia.org/T260320 [11:28:53] !log installing PHP 7.0 security updates [11:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:16] !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=jawikivoyage --fix | phaste # T260320 # P12481 [11:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:22] T260320: Create Wikivoyage Japanese - https://phabricator.wikimedia.org/T260320 [11:35:21] * Urbanecm is done with deployment [11:40:26] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: protect sets in templates against empty puppet vars [puppet] - 10https://gerrit.wikimedia.org/r/624025 (https://phabricator.wikimedia.org/T261724) [11:41:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: protect sets in templates against empty puppet vars [puppet] - 10https://gerrit.wikimedia.org/r/624025 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:42:49] (03PS2) 10Jbond: cfssl: add note about warning in logs [puppet] - 10https://gerrit.wikimedia.org/r/624001 [11:42:51] (03PS1) 10Jbond: cfssl: add ocsp responder servie [puppet] - 10https://gerrit.wikimedia.org/r/624026 (https://phabricator.wikimedia.org/T259117) [11:43:24] (03CR) 10Jbond: [C: 03+2] cfssl: add note about warning in logs [puppet] - 10https://gerrit.wikimedia.org/r/624001 (owner: 10Jbond) [11:45:28] !log installing net-snmp security updates on Buster [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:47] !log installing net-snmp security updates on Stretch [11:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:25] (03PS2) 10Jbond: cfssl: add ocsp responder servie [puppet] - 10https://gerrit.wikimedia.org/r/624026 (https://phabricator.wikimedia.org/T259117) [11:49:53] 10Operations, 10Analytics: Create analytics-announce@wikimedia.org - https://phabricator.wikimedia.org/T261946 (10elukey) [11:50:41] (03PS3) 10Jbond: cfssl: add ocsp responder servie [puppet] - 10https://gerrit.wikimedia.org/r/624026 (https://phabricator.wikimedia.org/T259117) [11:52:01] (03CR) 10Jbond: [C: 03+2] cfssl: add ocsp responder servie [puppet] - 10https://gerrit.wikimedia.org/r/624026 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [11:53:07] (03PS1) 10Ssingh: dnsdist: add validate_cmd attribute to the dnsdist.conf resource [puppet] - 10https://gerrit.wikimedia.org/r/624028 [11:53:45] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: allow monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/624029 (https://phabricator.wikimedia.org/T261724) [11:54:56] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/24914/" [puppet] - 10https://gerrit.wikimedia.org/r/624028 (owner: 10Ssingh) [11:54:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: allow monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/624029 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:57:05] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:53] 10Operations, 10Analytics: Create analytics-announce@wikimedia.org - https://phabricator.wikimedia.org/T261946 (10elukey) Looks like it worked :) https://lists.wikimedia.org/mailman/listinfo/analytics-announce [11:59:55] PROBLEM - Host db2125 is DOWN: PING CRITICAL - Packet loss = 100% [12:00:27] checking [12:01:15] ACKNOWLEDGEMENT - Host db2125 is DOWN: PING CRITICAL - Packet loss = 100% Kormat Who knows. [12:02:24] hw issue [12:02:44] (03PS1) 10Jbond: cfssl: add ocsp responder dump file [puppet] - 10https://gerrit.wikimedia.org/r/624033 (https://phabricator.wikimedia.org/T259117) [12:03:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool db2125 after hw issue', diff saved to https://phabricator.wikimedia.org/P12483 and previous config saved to /var/cache/conftool/dbconfig/20200903-120304-kormat.json [12:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre.discovery: Refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/621721 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:03:52] kormat: that's the host that had CPU errors a few days ago, right? [12:04:04] (03CR) 10Jbond: [C: 03+2] cfssl: add ocsp responder dump file [puppet] - 10https://gerrit.wikimedia.org/r/624033 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:04:05] yes. [12:04:08] :( [12:04:14] (03PS2) 10Jbond: cfssl: add ocsp responder dump file [puppet] - 10https://gerrit.wikimedia.org/r/624033 (https://phabricator.wikimedia.org/T259117) [12:04:27] kormat: I will get the task reopened then [12:04:39] i'm on it. [12:04:47] kormat: <3 thanks [12:06:19] (03CR) 10DCausse: [cirrusdumps] use temp dir and add better error handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) (owner: 10DCausse) [12:06:31] (03PS4) 10DCausse: [cirrusdumps] use temp dir and add better error handling [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) [12:06:32] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) 05Resolved→03Open This happened again. ` racadm>>racadm getsel Record: 1 Date/Time: 08/18/2020 15:23:07 Source: system Severity: Ok Desc... [12:07:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624028 (owner: 10Ssingh) [12:08:15] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:40] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) The full lclogs are here: {{P12484}} [12:08:44] (03CR) 10Ssingh: [C: 03+2] dnsdist: add validate_cmd attribute to the dnsdist.conf resource [puppet] - 10https://gerrit.wikimedia.org/r/624028 (owner: 10Ssingh) [12:09:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but I didn't run a full diff." [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [12:09:25] (03PS1) 10Hnowlan: Revert "Add apache config for api.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/623894 [12:17:39] !log installing openexr security updates for stretch [12:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:02] (03PS1) 10Hnowlan: mediawiki: place api-portal entry above wikimedia catchall [puppet] - 10https://gerrit.wikimedia.org/r/624037 (https://phabricator.wikimedia.org/T246945) [12:19:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Shift weights in s2 codfw to account for db2125 being down T260670', diff saved to https://phabricator.wikimedia.org/P12485 and previous config saved to /var/cache/conftool/dbconfig/20200903-121916-kormat.json [12:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:23] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [12:20:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM. Nice idea adding a comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/624037 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [12:21:10] (03CR) 10Hnowlan: [C: 03+2] mediawiki: place api-portal entry above wikimedia catchall [puppet] - 10https://gerrit.wikimedia.org/r/624037 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [12:21:21] (03CR) 10Kormat: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [12:21:59] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624038 (https://phabricator.wikimedia.org/T260670) [12:22:01] kormat: ^ makes sense? [12:22:52] oh, yes! [12:23:08] (03CR) 10Kormat: [C: 03+1] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624038 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [12:23:20] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624038 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [12:23:24] (03PS1) 10Muehlenhoff: Add library hint for openexr [puppet] - 10https://gerrit.wikimedia.org/r/624042 [12:24:33] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) a:05Kormat→03Papaul [12:26:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add --notify-age-in-days option to notify users before draft purge [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) (owner: 10KartikMistry) [12:26:24] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for openexr [puppet] - 10https://gerrit.wikimedia.org/r/624042 (owner: 10Muehlenhoff) [12:26:28] (03CR) 10Filippo Giunchedi: [C: 03+1] alerts: combine alerts.wm.o and icinga.wm.o certificates [puppet] - 10https://gerrit.wikimedia.org/r/623848 (https://phabricator.wikimedia.org/T261342) (owner: 10Herron) [12:26:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add --notify-age-in-days option to notify users before draft purge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) (owner: 10KartikMistry) [12:27:02] (03PS1) 10MSantos: WIP: push-notif: add stanzas for requests to MWAPI [deployment-charts] - 10https://gerrit.wikimedia.org/r/624043 (https://phabricator.wikimedia.org/T260247) [12:28:46] (03CR) 10MSantos: WIP: push-notif: add stanzas for requests to MWAPI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/624043 (https://phabricator.wikimedia.org/T260247) (owner: 10MSantos) [12:28:58] (03Abandoned) 10Hnowlan: Revert "Add apache config for api.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/623894 (owner: 10Hnowlan) [12:30:08] !log enabling puppet on appservers, finished rollout of api.wikimedia.org https://gerrit.wikimedia.org/r/c/operations/puppet/+/623833 [12:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "native prometheus! Nice! Minor inline comment." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/624012 (https://phabricator.wikimedia.org/T260807) (owner: 10MSantos) [12:39:16] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [12:39:32] (03PS3) 10Volans: Cleanup leftover record hhvm-api [dns] - 10https://gerrit.wikimedia.org/r/623765 (https://phabricator.wikimedia.org/T244153) [12:40:11] (03CR) 10Volans: [C: 03+2] Cleanup leftover record hhvm-api [dns] - 10https://gerrit.wikimedia.org/r/623765 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [12:40:30] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` ms-be2057.codfw.wmnet ` The log can be f... [12:46:28] !log Deploy MCR schema change on s7 eqiad master (lag might show up) - T238966 [12:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:34] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [12:48:05] (03CR) 10Marostegui: wmnet: Promote db1128 to m5-master [dns] - 10https://gerrit.wikimedia.org/r/623759 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [12:48:08] (03PS2) 10Marostegui: wmnet: Promote db1128 to m5-master [dns] - 10https://gerrit.wikimedia.org/r/623759 (https://phabricator.wikimedia.org/T260324) [12:48:15] (03CR) 10Marostegui: mariadb: Promote db1128 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623757 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [12:48:21] (03PS2) 10Marostegui: mariadb: Promote db1128 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623757 (https://phabricator.wikimedia.org/T260324) [12:49:55] RECOVERY - Host db2125 is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [12:51:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) No response on console, idrac says power is on. I've tried `serveraction hardreset`, but no response on console. `serveraction powerdo... [12:51:37] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [12:52:30] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) That sounds like mainboard to me :( [12:53:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] configmaster: add helm-charts to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/623958 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:59:39] (03CR) 10Hashar: [C: 03+1] backup_mariadb: Use path to find backup_mariadb.py [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623754 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:02:56] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2057.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2057.codfw.wmnet'] ` [13:08:06] andrewbogott: I am going to start with the pre m5 failover steps [13:08:11] Nothing required from your side :) [13:08:26] !log Start pre m5 failover steps T260324 [13:08:30] Ok! [13:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:32] T260324: Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 [13:18:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1128 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623757 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [13:31:56] (03CR) 10Kormat: [C: 03+2] cumin: Refactor db aliases. [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [13:32:43] (03PS1) 10Elukey: install_server: add reuse recipe for stat100x hosts with 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/624061 (https://phabricator.wikimedia.org/T255028) [13:32:54] !log jmm@deploy1001 Started deploy [debmonitor/deploy@25dbd20]: deploy to new buster host [13:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:59] !log jmm@deploy1001 Finished deploy [debmonitor/deploy@25dbd20]: deploy to new buster host (duration: 00m 05s) [13:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:29] (03PS9) 10Kormat: mariadb: Create profile::mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622972 (https://phabricator.wikimedia.org/T256972) [13:34:31] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) Sure! Although I have to admit I don't know what this means. It already runs envoyproxy as a sidecar for TLS. Is this so that its outgoin... [13:34:32] (03PS6) 10Kormat: mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) [13:34:34] (03PS6) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) [13:34:36] (03PS5) 10Kormat: mariadb: Make mariadb::config basedir required. [puppet] - 10https://gerrit.wikimedia.org/r/623582 (https://phabricator.wikimedia.org/T256972) [13:34:38] (03PS12) 10Kormat: mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) [13:35:00] In around 25 minutes I will put wikitech on read only for a around 1 minute to failover its master [13:38:14] (03CR) 10Kormat: [C: 03+2] Remove wmfbackups from wmfmariadbpy repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 (owner: 10Jcrespo) [13:39:15] !log jmm@deploy1001 Started deploy [debmonitor/deploy@25dbd20]: deploy to new buster host, now the --force is with me [13:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:53] (03Merged) 10jenkins-bot: Remove wmfbackups from wmfmariadbpy repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 (owner: 10Jcrespo) [13:40:44] !log jmm@deploy1001 Finished deploy [debmonitor/deploy@25dbd20]: deploy to new buster host, now the --force is with me (duration: 01m 29s) [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:14] moritzm: 10+ with final applause [13:41:48] (for the deploy msg) [13:41:56] 10Operations, 10Analytics, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) a:03Ottomata [13:42:28] I still didn't work, I clearly need to switch sides :-) [13:42:54] (03CR) 10Jakob: [C: 03+1] "Looks sensible" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [13:43:05] !log jmm@deploy1001 Started deploy [debmonitor/deploy@fb64c52]: deploy to new buster host [13:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:23] !log jmm@deploy1001 Finished deploy [debmonitor/deploy@fb64c52]: deploy to new buster host (duration: 00m 18s) [13:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:35] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [13:46:57] (03CR) 10Gehel: [C: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [13:47:06] (03CR) 10Kormat: [C: 03+1] "One minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624061 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [13:47:45] RECOVERY - debmonitor.wikimedia.org:80 on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [13:52:43] If someone needs to merge a DNS change, please talk to me first, I am about to merge but not deploy the m5 DNS change for the failover in around 3 minutes [13:53:00] (03PS1) 10Muehlenhoff: Add debmonitor1002 as debmonitor server [puppet] - 10https://gerrit.wikimedia.org/r/624066 (https://phabricator.wikimedia.org/T261489) [13:53:45] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:53:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:12] (03CR) 10Marostegui: [C: 03+2] wmnet: Promote db1128 to m5-master [dns] - 10https://gerrit.wikimedia.org/r/623759 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [14:00:04] marostegui and andrewbogott: Time to snap out of that daydream and deploy m5 database master failover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T1400). [14:00:09] o/ [14:00:23] jouncebot seems grumpy today [14:00:32] he doesn't like databases :) [14:00:35] (03CR) 10MSantos: push-notif: drop support to statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/624012 (https://phabricator.wikimedia.org/T260807) (owner: 10MSantos) [14:00:35] andrewbogott: let's go then? [14:00:40] yep! [14:00:54] !log Failover m5 (wikitech) master - T260324 [14:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:59] T260324: Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 [14:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T260324', diff saved to https://phabricator.wikimedia.org/P12487 and previous config saved to /var/cache/conftool/dbconfig/20200903-140135-marostegui.json [14:01:42] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:08] ahahaha [14:02:15] I wonder why that couldn't edit the SAL 🤔 [14:02:16] 🤔 [14:02:49] <_joe_> 🤔 [14:03:39] (03CR) 10Elukey: install_server: add reuse recipe for stat100x hosts with 4 disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624061 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [14:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1128 to wikitech master T260324', diff saved to https://phabricator.wikimedia.org/P12488 and previous config saved to /var/cache/conftool/dbconfig/20200903-140411-marostegui.json [14:04:15] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:04:20] (03PS2) 10Elukey: install_server: add reuse recipe for stat100x hosts with 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/624061 (https://phabricator.wikimedia.org/T255028) [14:04:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1128 to wikitech master T260324', diff saved to https://phabricator.wikimedia.org/P12489 and previous config saved to /var/cache/conftool/dbconfig/20200903-140436-marostegui.json [14:04:40] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set wikitech back to RW after maintenance T260324', diff saved to https://phabricator.wikimedia.org/P12490 and previous config saved to /var/cache/conftool/dbconfig/20200903-140451-marostegui.json [14:04:55] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:05:12] andrewbogott: wikitech is done, I am now merging the dns change for m5 [14:05:15] andrewbogott: done [14:05:36] (03CR) 10Kormat: [C: 03+1] "LGTM, bonne chance :)" [puppet] - 10https://gerrit.wikimedia.org/r/624061 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [14:05:42] marostegui: lgtm, I can edit things [14:05:43] I can edit wikitech fine [14:05:59] andrewbogott: the TTL for the m5-master change is 1M so your services should start connecting to db1128 now I think [14:07:14] (03CR) 10Elukey: [C: 03+2] install_server: add reuse recipe for stat100x hosts with 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/624061 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [14:07:20] dns changes can flow now normally [14:09:46] thanks marostegui [14:10:04] andrewbogott: thank you for the support! [14:10:21] Going to continue with the clean up tasks [14:10:32] (03PS1) 10Elukey: install_server: fix reuse-analytics-stat-4dev.cfg recipe [puppet] - 10https://gerrit.wikimedia.org/r/624077 (https://phabricator.wikimedia.org/T255028) [14:11:16] (03CR) 10Elukey: [C: 03+2] install_server: fix reuse-analytics-stat-4dev.cfg recipe [puppet] - 10https://gerrit.wikimedia.org/r/624077 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [14:11:23] (03PS2) 10Muehlenhoff: Add debmonitor1002 as debmonitor server [puppet] - 10https://gerrit.wikimedia.org/r/624066 (https://phabricator.wikimedia.org/T261489) [14:11:24] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [14:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] andrewbogott: everything looking good on your side? [14:13:38] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:55] yep [14:14:04] andrewbogott: cool! [14:16:21] (03CR) 10Muehlenhoff: [C: 03+2] Add debmonitor1002 as debmonitor server [puppet] - 10https://gerrit.wikimedia.org/r/624066 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [14:16:43] (03PS1) 10Marostegui: db1133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624079 (https://phabricator.wikimedia.org/T260324) [14:17:18] (03CR) 10Marostegui: [C: 03+2] db1133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624079 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [14:26:06] (03PS1) 10Volans: scripts: allocate IPs, add Cassandra support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) [14:29:05] !log jmm@deploy1001 Started deploy [debmonitor/deploy@fb64c52]: deploy to new buster host [14:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:12] !log jmm@deploy1001 Finished deploy [debmonitor/deploy@fb64c52]: deploy to new buster host (duration: 00m 06s) [14:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:48] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:47] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] (03CR) 10Muehlenhoff: [C: 03+2] Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [14:39:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "we can safely assume that next reimages for cloudvirt servers would be to migrate them to Debian Buster." [puppet] - 10https://gerrit.wikimedia.org/r/622974 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [14:42:35] (03CR) 10Muehlenhoff: [C: 03+2] Turnilo: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/615461 (owner: 10Muehlenhoff) [14:42:45] (03PS2) 10Muehlenhoff: Turnilo: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/615461 [14:45:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove apt::pin for ceph packages in openstack/rocky/stretch [puppet] - 10https://gerrit.wikimedia.org/r/622974 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [14:49:09] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [14:51:23] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) @akosiaris since the HW problem is resolved can I close this task? Also can you please open a new decom task for the old restbase2009 with asset tag wmf6412 https://netbox.wikimedia.org/dcim/devic... [14:53:05] (03PS1) 10Ottomata: Add $datacenter parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [14:53:13] !log power down ores2001 for DIMM upgrade [14:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:15] (03PS2) 10Ottomata: Add $datacenter parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [14:57:38] (03PS1) 10Hashar: docker:reporter: drop old images filters [puppet] - 10https://gerrit.wikimedia.org/r/624095 [14:57:40] (03PS1) 10Hashar: docker:reporter: do include latest images for releng/* [puppet] - 10https://gerrit.wikimedia.org/r/624096 (https://phabricator.wikimedia.org/T261207) [15:00:37] (03PS1) 10Andrew Bogott: eqiad1 openstack: replace missing monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) [15:01:48] (03CR) 10jerkins-bot: [V: 04-1] eqiad1 openstack: replace missing monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) (owner: 10Andrew Bogott) [15:04:15] 10Operations, 10Analytics: Create analytics-announce@wikimedia.org - https://phabricator.wikimedia.org/T261946 (10elukey) 05Open→03Resolved [15:07:22] (03PS2) 10Andrew Bogott: eqiad1 openstack: replace missing monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) [15:07:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [15:08:02] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) ` pt1979@ores2001:~$ free total used free shared buff/cache available Mem: 131941296 2626056 128034636 33420... [15:08:33] !log power down ores2002 for DIMM upgrade [15:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:42] (03CR) 10Ottomata: Add $datacenter parameter to wmflib::service:get_url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:13:19] (03CR) 10Kormat: [C: 03+2] mariadb: Create profile::mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622972 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:14:19] (03CR) 10Herron: [C: 03+2] alerts: combine alerts.wm.o and icinga.wm.o certificates [puppet] - 10https://gerrit.wikimedia.org/r/623848 (https://phabricator.wikimedia.org/T261342) (owner: 10Herron) [15:14:28] (03PS2) 10Herron: alerts: combine alerts.wm.o and icinga.wm.o certificates [puppet] - 10https://gerrit.wikimedia.org/r/623848 (https://phabricator.wikimedia.org/T261342) [15:15:10] (03PS3) 10Andrew Bogott: eqiad1 openstack: replace missing monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) [15:17:37] !log installing firejail security updates on parsoid servers [15:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:48] 10Operations, 10Analytics, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#6433346, @Ottomata wrote: > Sure! Although I have to admit I don't know what this means. It already runs env... [15:17:58] (03PS4) 10Andrew Bogott: eqiad1 openstack: replace missing monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) [15:20:06] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [15:21:07] (03CR) 10Andrew Bogott: "pcc results at https://puppet-compiler.wmflabs.org/compiler1003/24918/cloudcontrol1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) (owner: 10Andrew Bogott) [15:21:21] (03PS7) 10Kormat: mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) [15:21:23] (03PS7) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) [15:21:25] (03PS6) 10Kormat: mariadb: Make mariadb::config basedir required. [puppet] - 10https://gerrit.wikimedia.org/r/623582 (https://phabricator.wikimedia.org/T256972) [15:21:27] (03PS13) 10Kormat: mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) [15:21:59] !log power down ores2003 for DIMM upgrade [15:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:35] (03CR) 10Marostegui: [C: 03+1] mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:23:11] (03CR) 10Kormat: [C: 03+2] mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:25:14] !log installing firejail update (along with restarts) on thumbor1001, maps1001, restbase1016 (and -dev) [15:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:44] (03PS1) 10Joal: Update AQS druid datasource to 2020-08 [puppet] - 10https://gerrit.wikimedia.org/r/624101 [15:28:11] elukey: --^ :) [15:29:41] 10Operations, 10Analytics, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) OH! yes...there was a reason we left HTTP on...I think it was before MW was using a local envoyproxy to do TLS, because PHP... [15:30:48] !log installing nginx updates on apt* and htmldumper1001 [15:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:59] joal: let's involve razzi in this so we can explain to him the process [15:31:03] ack elukey [15:33:12] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [15:33:52] !log power down ores2004 for DIMM upgrade [15:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:01] 10Operations, 10Traffic, 10User-ArielGlenn: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 (10MoritzMuehlenhoff) [15:35:19] 10Operations, 10Traffic, 10User-ArielGlenn, 10User-MoritzMuehlenhoff: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10MoritzMuehlenhoff) [15:37:50] 10Operations, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10colewhite) I tried this today. It was unable to parse Zayo or Telia new scheduled maintenance emails, but successfully parsed NTT and GTT new scheduled maintenance emails. At this point,... [15:38:48] (03CR) 10Marostegui: [C: 03+1] mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:39:29] (03CR) 10Hashar: [C: 03+1] Updated some cross references in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 (owner: 10Ahmon Dancy) [15:45:47] (03CR) 10Kormat: [C: 03+2] mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:46:07] (03CR) 10Kormat: [C: 03+2] mariadb: Make mariadb::config basedir required. [puppet] - 10https://gerrit.wikimedia.org/r/623582 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:51:47] !log power down ores2005 for DIMM upgrade [15:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:09] (03CR) 10CRusnov: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/623805 (owner: 10Herron) [15:59:16] 10Operations, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Krinkle) 05Open→03Resolved Looks good to me now: [Grafana: Save Timing](https://grafana.wikimedia.org/d/000... [15:59:40] 10Operations, 10Cassandra: Move cassandra puppet code to profile::java - https://phabricator.wikimedia.org/T261966 (10elukey) [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T1600). [16:00:52] <_joe_> uhm didn't we change that? :P [16:01:06] <_joe_> anyways, nothing to merge [16:02:23] (03CR) 10Hashar: [C: 03+1] "I can't +2/merge it ;D" [debs/hue] - 10https://gerrit.wikimedia.org/r/619438 (owner: 10Hashar) [16:02:57] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [16:03:01] (03CR) 10CRusnov: "This looks fine but perhaps not cassandra specific? Seem comment." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [16:05:51] !log power down ores2006 for DIMM upgrade [16:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:33] (03CR) 10Herron: [C: 03+2] dns: remove unused ganeti500[123] ipv6 records [dns] - 10https://gerrit.wikimedia.org/r/623805 (owner: 10Herron) [16:12:37] (03PS2) 10Herron: dns: remove unused ganeti500[123] ipv6 records [dns] - 10https://gerrit.wikimedia.org/r/623805 [16:16:53] (03PS1) 10CRusnov: modules/admin/data/nda_audit.py: Fix Python3 pep8 errors [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) [16:17:21] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [16:18:11] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10akosiaris) 05Open→03Resolved Sure. I 've resolved it and filed T261968 for wmf6412. Thanks! [16:18:30] (03CR) 10Elukey: [C: 03+2] Update AQS druid datasource to 2020-08 [puppet] - 10https://gerrit.wikimedia.org/r/624101 (owner: 10Joal) [16:18:55] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:24:04] !log roll restart aqs on aqs1* to pick up new druid settings [16:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:14] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) @Dwisehaupt thnaks [16:32:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson) [16:32:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson) 05Open→03Resolved Completed [16:33:21] !log power down ores2007 for DIMM upgrade [16:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:06] (03PS1) 10CRusnov: modules/service/files/logstash_checker.py: Fix Python3 PEP8 errors [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) [16:37:14] (03PS1) 10Cmjohnson: updating partman recipe for pki1001 to reflect only have 2 disks [puppet] - 10https://gerrit.wikimedia.org/r/624118 (https://phabricator.wikimedia.org/T259826) [16:37:51] (03CR) 10Cmjohnson: [C: 03+2] updating partman recipe for pki1001 to reflect only have 2 disks [puppet] - 10https://gerrit.wikimedia.org/r/624118 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [16:38:21] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:41:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` pki1001.eqiad.wmnet ` The log can be found in... [16:44:24] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [16:44:30] (03PS1) 10CRusnov: toolforge/gridscripts/runninggridtasks.py: Fix Python3 PEP8 Warning [puppet] - 10https://gerrit.wikimedia.org/r/624122 (https://phabricator.wikimedia.org/T247364) [16:45:12] !log power down ores2008 for DIMM upgrade [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:24] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624122 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:46:47] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1100.eqiad.wmnet ` The log can be found in `/va... [16:50:59] (03PS2) 10Hnowlan: helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 [16:52:04] (03PS1) 10Herron: prometheus: switch over to buster kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/623902 (https://phabricator.wikimedia.org/T252773) [16:52:47] (03CR) 10Hnowlan: [C: 03+2] helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 (owner: 10Hnowlan) [16:54:04] (03CR) 10jerkins-bot: [V: 04-1] helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 (owner: 10Hnowlan) [16:57:28] (03PS2) 10Herron: prometheus: switch over to buster kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/623902 (https://phabricator.wikimedia.org/T252773) [16:58:12] (03PS3) 10Hnowlan: helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 [16:59:04] (03CR) 10Volans: ""latest" is just a symlink to a specific tag that changes over time and debmonitor has no clue to which one. I think they were excluded on" [puppet] - 10https://gerrit.wikimedia.org/r/624096 (https://phabricator.wikimedia.org/T261207) (owner: 10Hashar) [16:59:51] no [17:00:05] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T1700). [17:00:47] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1100.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1100.eqiad.wmnet'] ` [17:01:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pki1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['pki1001.eqiad.wmnet'] ` [17:02:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` pki1001.eqiad.wmnet ` The log can be found in... [17:02:19] !log power down ores2009 for DIMM upgrade [17:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:44] (03CR) 10Hnowlan: [C: 03+2] helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 (owner: 10Hnowlan) [17:03:48] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [17:03:50] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1100.eqiad.wmnet ` The log can be found in `/va... [17:04:49] (03CR) 10Herron: [C: 03+2] prometheus: switch over to buster kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/623902 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [17:05:20] (03Merged) 10jenkins-bot: helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 (owner: 10Hnowlan) [17:07:17] (03CR) 10Volans: "reply inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:13:07] (03PS3) 10Ottomata: Add $datacenter parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [17:13:35] PROBLEM - Host kubernetes2004 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:18] (03PS4) 10Ottomata: Add $datacenter parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [17:15:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1100.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1100.eqiad.wmnet'] ` [17:16:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:50] 10Operations, 10Analytics-Clusters, 10Analytics-Radar, 10observability, 10Patch-For-Review: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) [17:19:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:19:07] 10Operations, 10Analytics-Clusters, 10Analytics-Radar, 10observability, 10Patch-For-Review: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) The buster kafkamon hosts are now live. Will let them settle for a bit and then move on to cleanup/teardown of the old h... [17:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:56] (03CR) 10Xqt: toolforge/gridscripts/runninggridtasks.py: Fix Python3 PEP8 Warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624122 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:19:57] RECOVERY - Host kubernetes2004 is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [17:20:09] k8s2004 was me accident trying to get ores2009 out and bumped into k8s2004 [17:20:26] thanks for noting papaul [17:20:51] (03PS1) 10Mholloway: Update mobileapps to 2020-09-03-133327-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/624145 [17:22:14] jayme: np [17:23:31] (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-09-03-133327-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/624145 (owner: 10Mholloway) [17:24:50] (03Merged) 10jenkins-bot: Update mobileapps to 2020-09-03-133327-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/624145 (owner: 10Mholloway) [17:26:07] (03PS1) 10Volans: dns: generate records for all VMs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624154 (https://phabricator.wikimedia.org/T244153) [17:28:17] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1096.eqiad.wmnet ` The log can be found in `/va... [17:28:51] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:31] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1097.eqiad.wmnet ` The log can be found in `/va... [17:30:00] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) [17:30:46] (03PS1) 10Ottomata: [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [17:31:07] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1098.eqiad.wmnet ` The log can be found in `/va... [17:31:34] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) a:05Papaul→03akosiaris @akosiaris All yours' IF all good please go ahead and resolve the task. Thanks [17:31:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [17:32:08] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [17:32:08] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:32:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1099.eqiad.wmnet ` The log can be found in `/va... [17:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:38] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1100.eqiad.wmnet ` The log can be found in `/va... [17:35:59] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) There don't see to have been any actionable comments regarding the test installation, either in phabricator or OTRS Cafe. In th... [17:36:22] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:36:22] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [17:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:22] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) I will open a case with Dell [17:37:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1101.eqiad.wmnet ` The log can be found in `/va... [17:38:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pki1001.eqiad.wmnet'] ` and were **ALL** successful. [17:40:20] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10akosiaris) 05Open→03Resolved Looking at graphs and metrics, the operation was noticeable but not causing any issues. some scores errored but not particularly more tha... [17:41:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:44] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:45:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:48] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:46:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:08] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:50] PROBLEM - kubelet operational latencies on kubernetes2011 is CRITICAL: instance=kubernetes2011.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:58:52] PROBLEM - kubelet operational latencies on kubernetes2008 is CRITICAL: instance=kubernetes2008.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:04] RECOVERY - kubelet operational latencies on kubernetes2011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:00:04] RECOVERY - kubelet operational latencies on kubernetes2008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:05:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1096.eqiad.wmnet'] ` and were **ALL** successful. [18:06:23] (03CR) 10Dzahn: [C: 03+2] openstack::cumin: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621369 (owner: 10Dzahn) [18:06:31] (03PS2) 10Dzahn: openstack::cumin: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621369 [18:06:39] (03CR) 10Dzahn: [C: 03+2] "ack, thanks Andrew!" [puppet] - 10https://gerrit.wikimedia.org/r/621369 (owner: 10Dzahn) [18:07:52] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1098.eqiad.wmnet'] ` and were **ALL** successful. [18:14:25] (03PS9) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [18:15:38] (03CR) 10Dzahn: prometheus: replace remaining hiera() with lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [18:15:45] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [18:17:33] (03PS2) 10Ottomata: [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:19:29] (03PS3) 10Ottomata: [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:20:14] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) Look good. thank you Akosiaris for all you work. It might be worth it to notify the various OTRS mailing list of the downti... [18:21:11] (03PS4) 10Ottomata: [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:21:49] (03PS10) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [18:23:52] (03PS5) 10Ottomata: [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:25:52] (03PS6) 10Ottomata: [WIP] canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:27:27] (03PS5) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [18:27:48] (03CR) 10Dzahn: cache::base: replace hiera() with lookup(), add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [18:28:36] (03CR) 10jerkins-bot: [V: 04-1] cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [18:33:32] (03PS6) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [18:35:09] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1097.eqiad.wmnet'] ` and were **ALL** successful. [18:35:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1100.eqiad.wmnet'] ` and were **ALL** successful. [18:35:43] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1099.eqiad.wmnet'] ` and were **ALL** successful. [18:38:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1101.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1101.eqiad.wmnet'] ` [18:40:08] (03PS7) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:41:19] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:41:48] (03PS5) 10Southparkfan: nagios-nrpe-server systemd unit: use /run for PID files [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) [18:43:47] (03PS1) 10Southparkfan: os_version: remove wheezy, add bullseye [puppet] - 10https://gerrit.wikimedia.org/r/624217 [18:44:55] (03CR) 10Southparkfan: "Per your request from Iacdc52dded7e3f1838cd1579448a29cb66fdc8ab." [puppet] - 10https://gerrit.wikimedia.org/r/624217 (owner: 10Southparkfan) [18:45:13] (03CR) 10Dzahn: [C: 03+1] os_version: remove wheezy, add bullseye [puppet] - 10https://gerrit.wikimedia.org/r/624217 (owner: 10Southparkfan) [18:47:29] (03CR) 10Dzahn: nagios-nrpe-server systemd unit: use /run for PID files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [18:48:37] (03PS8) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:49:59] (03PS9) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:50:17] mutante: something went wrong in 621967 with restoring os_version to its original state :/ [18:50:23] sorry for that, trying to fix it [18:50:55] ack, i wasn't sure if it snuck in during rebase or was a lint fix, no worries at all [18:51:08] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:53:51] (03PS10) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:55:02] (03PS6) 10Southparkfan: nagios-nrpe-server systemd unit: use /run for PID files [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) [18:55:06] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:55:26] great, now that error is gone as well [18:55:27] (03PS11) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:56:46] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:57:30] (03PS12) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [18:57:39] !log milimetric@deploy1001 Started deploy [analytics/refinery@e4d5149]: Regular analytics weekly train [analytics/refinery@e4d5149] [18:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:55] (03CR) 10Dzahn: [C: 03+1] "looks right to me, but merging this probably needs extra care. would need the service to be restarted on every single server and be carefu" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [18:58:38] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:01:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/624217 (owner: 10Southparkfan) [19:06:45] !log milimetric@deploy1001 Finished deploy [analytics/refinery@e4d5149]: Regular analytics weekly train [analytics/refinery@e4d5149] (duration: 09m 06s) [19:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:17] !log milimetric@deploy1001 Started deploy [analytics/refinery@e4d5149] (thin): Regular analytics weekly train THIN [analytics/refinery@e4d5149] [19:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:26] !log milimetric@deploy1001 Finished deploy [analytics/refinery@e4d5149] (thin): Regular analytics weekly train THIN [analytics/refinery@e4d5149] (duration: 00m 08s) [19:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:55] (03PS13) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [19:15:20] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:19:20] (03PS1) 10Ryan Kemper: cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) [19:19:43] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) (owner: 10Ryan Kemper) [19:22:11] (03PS14) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [19:22:34] (03PS2) 10Ryan Kemper: cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) [19:23:02] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) (owner: 10Ryan Kemper) [19:23:20] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:23:41] (03PS3) 10Ryan Kemper: cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) [19:23:43] (03PS2) 10Ottomata: Update analytics snapshots data purge [puppet] - 10https://gerrit.wikimedia.org/r/623601 (https://phabricator.wikimedia.org/T237047) (owner: 10Joal) [19:25:12] (03CR) 10Ottomata: [C: 03+2] Update analytics snapshots data purge [puppet] - 10https://gerrit.wikimedia.org/r/623601 (https://phabricator.wikimedia.org/T237047) (owner: 10Joal) [19:25:21] (03PS15) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [19:26:33] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:27:03] (03PS16) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [19:28:26] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:28:37] (03PS4) 10Ryan Kemper: cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) [19:28:38] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6679 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:29:59] (03CR) 10Ryan Kemper: "It's been several months, but you touched this area last so hopefully you've got some context here!" [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) (owner: 10Ryan Kemper) [19:32:20] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5871 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:34:14] (03PS6) 10Dzahn: service.yaml: add releases as a service without LVS [puppet] - 10https://gerrit.wikimedia.org/r/623464 [19:36:48] (03PS1) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 [19:37:46] (03PS2) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 [19:37:56] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5823 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:38:38] (03PS3) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T230625) [19:46:29] (03PS4) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T220625) [19:47:14] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 79 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:48:21] (03PS5) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T220625) [19:50:11] (03PS6) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T220625) [19:51:45] (03PS17) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [19:52:50] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:53:31] (03PS18) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [19:54:48] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@95d6432]: AQS: Deploying new geoeditors endpoints [19:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:01] !log milimetric@deploy1001 deploy aborted: AQS: Deploying new geoeditors endpoints (duration: 00m 13s) [19:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:11] (sorry about that, something came up and I can't babysit it properly, I'll continue later) [19:56:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:59:47] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1101.eqiad.wmnet ` The log can be found in `/va... [20:04:07] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624154 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [20:04:11] (03PS7) 10Dzahn: service.yaml: add releases as a service without LVS [puppet] - 10https://gerrit.wikimedia.org/r/623464 [20:04:33] (03CR) 10Dzahn: "> Patch Set 5: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [20:11:34] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:13:26] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:17:02] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/24930/" [puppet] - 10https://gerrit.wikimedia.org/r/623079 (owner: 10Dzahn) [20:18:08] (03PS1) 10Cmjohnson: updating an-worker1101 dns to reflect correct rack location [dns] - 10https://gerrit.wikimedia.org/r/624258 (https://phabricator.wikimedia.org/T254892) [20:18:16] (03CR) 10jerkins-bot: [V: 04-1] updating an-worker1101 dns to reflect correct rack location [dns] - 10https://gerrit.wikimedia.org/r/624258 (https://phabricator.wikimedia.org/T254892) (owner: 10Cmjohnson) [20:18:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:49] checking [20:19:39] (03PS2) 10Cmjohnson: updating an-worker1101 dns to reflect correct rack location [dns] - 10https://gerrit.wikimedia.org/r/624258 (https://phabricator.wikimedia.org/T254892) [20:20:12] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:18] (03CR) 10Cmjohnson: [C: 03+2] updating an-worker1101 dns to reflect correct rack location [dns] - 10https://gerrit.wikimedia.org/r/624258 (https://phabricator.wikimedia.org/T254892) (owner: 10Cmjohnson) [20:22:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1101.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1101.eqiad.wmn... [20:22:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1101.eqiad.wmnet ` The lo... [20:23:48] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/24924/an-launcher1002.eqiad.wmnet/change.an-launcher1002.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:58:06] (03PS5) 10Ahmon Dancy: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [20:59:14] (03CR) 10jerkins-bot: [V: 04-1] Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [21:00:42] (03PS6) 10Ahmon Dancy: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [21:02:05] (03CR) 10Thcipriani: Script to update image versions (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [21:14:11] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Johan) Included in https://meta.wikimedia.org/wiki/Tech/News/2020/37 going out on Monday. [21:15:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [21:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:40] (03PS1) 10Legoktm: libraryupgrader: Restart libup-web service every 24h [puppet] - 10https://gerrit.wikimedia.org/r/624269 [21:27:02] (03PS1) 10Legoktm: codesearch: Update URL to wmcloud.org and fix name [puppet] - 10https://gerrit.wikimedia.org/r/624271 [21:29:40] (03CR) 10Dzahn: [C: 03+2] codesearch: Update URL to wmcloud.org and fix name [puppet] - 10https://gerrit.wikimedia.org/r/624271 (owner: 10Legoktm) [21:31:10] (03CR) 10Dzahn: [C: 03+2] libraryupgrader: Restart libup-web service every 24h [puppet] - 10https://gerrit.wikimedia.org/r/624269 (owner: 10Legoktm) [21:31:25] mutante: thank you :)) [21:31:47] np :) [21:32:39] 10Operations, 10Traffic, 10User-ArielGlenn: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 (10RKemper) p:05Triage→03Medium [21:32:49] 10Operations, 10Cassandra: Move cassandra puppet code to profile::java - https://phabricator.wikimedia.org/T261966 (10RKemper) p:05Triage→03Medium [21:33:19] 10Operations, 10Traffic, 10User-ArielGlenn: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 (10Dzahn) a:03Dzahn [21:33:35] 10Operations: docker-registry.wikimedia.org/golang:1.11 should no more depends on stretch-backports - https://phabricator.wikimedia.org/T261920 (10RKemper) p:05Triage→03Medium [21:36:39] legoktm: i just moved a VPS project to wmcloud as well, the old URL should just keep working and redirect automatically [21:37:08] mutante: indeed :) I was just about to do the same [21:37:55] ack, confirmed working for me to just delete proxy and recreate one under wmcloud [21:39:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1101.eqiad.wmnet'] ` and were **ALL** successful. [22:12:00] (03CR) 1020after4: "I think this is fine" [puppet] - 10https://gerrit.wikimedia.org/r/623080 (owner: 10Dzahn) [22:16:53] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:17:10] (03PS3) 10Dzahn: phabricator: remove hiera() lookup from module [puppet] - 10https://gerrit.wikimedia.org/r/623080 [22:22:50] (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1002/24931/" [puppet] - 10https://gerrit.wikimedia.org/r/623080 (owner: 10Dzahn) [22:24:47] mutante: I am curious after following the Puppet code changes: do you manually grep for them or is there an automated process you are following to detect and fix these? (mostly curious) [22:25:50] sukhe: yea, i just grep a lot. not really automated [22:26:36] grep -r "hiera(" * inside puppet/modules/ works fairly well. f.e. [22:26:54] there shouldn't be any inside modules [22:27:17] well.. if you subtract modules/profile/ [22:27:48] and the ones that are in profile parameters should all be replaced with lookup() [22:27:55] I see. and so if I were to make a change to a module now and assuming they are using the older hiera lookup (and/or something else that was deprecated), the compiler would complain, right? [22:28:32] jerkins-bot would complain if you add a new line with hiera() [22:28:45] but not if you just touch a file that has existing ones [22:29:07] ah! interesting [22:29:13] also the style-check actually has running tally. so let's say you are fixing 4 style issues and add 3 in the same patch, that's a net positive and a +1 [22:29:23] or fixing 3 but adding 4 = -1 [22:29:40] I see; TIL [22:31:14] regarding the puppet compiler. that is a warning "Warning: The function 'hiera' is deprecated.." so this just removes noise from the log [22:31:48] but warnings doesn't mean it fails to compile [22:32:32] yeah :) [22:51:24] (03PS1) 10Dzahn: lists: move hiera lookup out of module, add servername as proper parameter [puppet] - 10https://gerrit.wikimedia.org/r/624310 [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200903T2300). [23:00:05] Huji: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:53] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/24932/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/624310 (owner: 10Dzahn) [23:02:57] (03PS2) 10Dzahn: lists: move hiera lookup out of module, add servername as proper parameter [puppet] - 10https://gerrit.wikimedia.org/r/624310 [23:04:24] (03PS1) 10Dzahn: lists: replace hardcoded server name with variable [puppet] - 10https://gerrit.wikimedia.org/r/624319 [23:08:16] Huji: I can deploy today [23:09:31] (03PS1) 10Dzahn: webperf: replace hiera() with lookup() in profiles [puppet] - 10https://gerrit.wikimedia.org/r/624322 [23:19:26] (03PS1) 10Dzahn: dumps: remove the do_acme parameter and lookup [puppet] - 10https://gerrit.wikimedia.org/r/624328 [23:21:46] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [23:23:27] Urbanecm: hi again! [23:23:27] huji: hey! [23:23:32] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:23:54] (03PS1) 10Dzahn: ntp::daemon: replace hiera() with lookup(), lint [puppet] - 10https://gerrit.wikimedia.org/r/624332 [23:23:56] ok, I am on mwdebug1001 ready to sign in [23:24:20] (03CR) 10Urbanecm: [C: 03+2] Start logging log-ins on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [23:24:25] (03PS9) 10Urbanecm: Start logging log-ins on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [23:24:34] (03CR) 10Urbanecm: [C: 03+2] Start logging log-ins on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [23:24:53] cool huji. Do you have a bot account ready to test that too? [23:25:05] I do. My own Bot account :) [23:25:21] (03Merged) 10jenkins-bot: Start logging log-ins on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [23:25:25] huji: btw, it has to be mwdebug2001, we're switch-overed - 1xxx hosts are read only :) [23:25:50] equally good [23:26:11] pulled onto mwdebug2001 [23:26:46] logged in and out with test non-bot account [23:27:31] logged in and out using bot account [23:28:24] Confirming that login from non-bot account appears in CU results [23:28:36] excellent! [23:28:38] let's sync then [23:28:45] Confirming that login from bot account does *not* appear in CU results [23:28:58] even better [23:29:03] Logged out of 2001 [23:30:16] syncing to the wild [23:31:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 93947391e97be11a9cd7eb4713b274b05d5b371a: Start logging log-ins on select wikis (T253802) (duration: 00m 56s) [23:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:15] T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 [23:31:51] huji: should be enabled in all fleet [23:32:30] Thanks. Off to creating a task for monitoring DB growth [23:32:48] cool! please subscribe me huji [23:37:13] Urbanecm: Done. Thanks again! Signing off now. [23:37:22] happy to help! [23:40:56] (03PS1) 10Dzahn: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 [23:45:36] (03PS1) 10Dzahn: puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 [23:50:10] (03PS1) 10Dzahn: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 [23:51:37] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (owner: 10Dzahn) [23:52:36] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/624322 (owner: 10Dzahn) [23:53:00] (03PS1) 10Dzahn: scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 [23:54:01] (03CR) 10jerkins-bot: [V: 04-1] scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 (owner: 10Dzahn) [23:59:27] (03PS1) 10Dzahn: service::node: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624346