[01:15:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [01:17:00] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:30:02] (03PS1) 10Marostegui: es2029,es2030: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628632 (https://phabricator.wikimedia.org/T261717) [04:30:35] (03CR) 10Marostegui: [C: 03+2] es2029,es2030: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628632 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [04:31:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2029 and es2030 for the first time with minimal weight T261717', diff saved to https://phabricator.wikimedia.org/P12670 and previous config saved to /var/cache/conftool/dbconfig/20200921-043154-marostegui.json [04:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:01] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [04:37:56] !log Set innodb_change_buffering = inserts; on db2116 for performance testing [04:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2029 and es2030 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12671 and previous config saved to /var/cache/conftool/dbconfig/20200921-045919-marostegui.json [04:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:25] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:02:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2015 as es2 codfw master T261717', diff saved to https://phabricator.wikimedia.org/P12672 and previous config saved to /var/cache/conftool/dbconfig/20200921-050228-marostegui.json [05:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2013,es2016 and es2019 to clone new hosts T261717', diff saved to https://phabricator.wikimedia.org/P12673 and previous config saved to /var/cache/conftool/dbconfig/20200921-050305-marostegui.json [05:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:02] !log Deploy MCR schema change on s8 eqiad master, lag will appear on s8 (wikidata) on labsdb hosts T238966 [05:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:08] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2029 and es2030 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12674 and previous config saved to /var/cache/conftool/dbconfig/20200921-050632-marostegui.json [05:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:37] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:07:43] (03PS1) 10Marostegui: db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628634 [05:08:35] (03CR) 10Marostegui: [C: 03+2] db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628634 (owner: 10Marostegui) [05:15:59] (03PS1) 10Marostegui: mariadb: Productionize es2031,es2033,es2034 [puppet] - 10https://gerrit.wikimedia.org/r/628635 (https://phabricator.wikimedia.org/T261717) [05:17:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es2031,es2033,es2034 [puppet] - 10https://gerrit.wikimedia.org/r/628635 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:18:13] !log Stop mysql on: es2013 es2016 es2019 to clone es2032 es2033 es2034 - T261717 [05:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:18] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:27:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2029 and es2030 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12675 and previous config saved to /var/cache/conftool/dbconfig/20200921-052704-marostegui.json [05:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:10] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:47:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2029 and es2030 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12676 and previous config saved to /var/cache/conftool/dbconfig/20200921-054730-marostegui.json [05:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:36] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:48:14] !log Set innodb_change_buffering = inserts; on db2129 (s6 master) for performance testing [05:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool es2029 and es2030 T261717', diff saved to https://phabricator.wikimedia.org/P12677 and previous config saved to /var/cache/conftool/dbconfig/20200921-060053-marostegui.json [06:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:58] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:28:56] (03PS1) 10Elukey: profile::hue: fix nagios alarms for Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/628653 [06:36:38] 10Operations, 10LDAP-Access-Requests: Access to archiva-deployers for Trey Jones - https://phabricator.wikimedia.org/T263386 (10elukey) [06:37:51] 10Operations, 10LDAP-Access-Requests: Access to archiva-deployers for Trey Jones - https://phabricator.wikimedia.org/T263386 (10elukey) 05Open→03Resolved a:03elukey ` elukey@mwmaint1002:~$ ldapsearch -x -b ou=groups,dc=wikimedia,dc=org cn="archiva-deployers" | grep tjones member: uid=tjones,ou=people,dc... [06:38:52] (03CR) 10Elukey: [C: 03+2] profile::hue: fix nagios alarms for Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/628653 (owner: 10Elukey) [06:46:47] (03PS1) 10ArielGlenn: Revert "disable category rdf dumps for now" [puppet] - 10https://gerrit.wikimedia.org/r/628607 [06:47:59] (03PS2) 10ArielGlenn: Revert "disable category rdf dumps for now" [puppet] - 10https://gerrit.wikimedia.org/r/628607 [06:50:08] (03CR) 10ArielGlenn: [C: 03+2] Revert "disable category rdf dumps for now" [puppet] - 10https://gerrit.wikimedia.org/r/628607 (owner: 10ArielGlenn) [06:51:20] (03PS1) 10Elukey: Use the CAS REMOTE_USER http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 [06:55:22] (03PS2) 10Elukey: Use the CAS REMOTE_USER http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 [06:56:21] (03CR) 10jerkins-bot: [V: 04-1] Use the CAS REMOTE_USER http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [06:56:27] (03CR) 10Muehlenhoff: [C: 04-1] "Don't bother, the entire Puppet module can be removed, I'll prepare a patch later." [puppet] - 10https://gerrit.wikimedia.org/r/628462 (owner: 10Dzahn) [07:02:52] (03PS3) 10Elukey: Use the CAS REMOTE_USER http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 [07:03:38] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10ayounsi) a:03ayounsi [07:04:56] (03PS4) 10Elukey: Use the CAS REMOTE_USER http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 [07:05:37] !log upgrade FNM to 1.1.7 in ulsfo - T257035 [07:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:43] T257035: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 [07:08:10] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25199/" [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [07:14:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [07:14:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1003/25200/mw1331.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/628243 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [07:19:34] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Ladsgroup) I assume someone from langcom should take a look and approve this at least. [07:22:49] (03PS1) 10Muehlenhoff: Remove the Yubikey HSM classes [puppet] - 10https://gerrit.wikimedia.org/r/628743 [07:43:43] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10ayounsi) Thanks @MoritzMuehlenhoff I installed in on netflow4001 and it is working fine. Surprisingly though one new CLI tool `fastnetmon_api_client` is missing from the DEB. Was there any issues during the buil... [07:48:09] (03PS4) 10JMeybohm: lvs: Remove termbox non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627298 (https://phabricator.wikimedia.org/T254581) [07:48:20] (03PS2) 10JMeybohm: lvs: Remove cxserver non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627432 (https://phabricator.wikimedia.org/T255879) [07:48:53] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove termbox non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627298 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [07:48:58] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove cxserver non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627432 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [07:53:33] !log Upgrading all CI Jenkins jobs to Quibble 0.0.45 [07:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:57] (03PS3) 10JMeybohm: lvs: Remove cxserver non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627432 (https://phabricator.wikimedia.org/T255879) [08:01:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/628095 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [08:03:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The code is overall sound, and I like its minimalism. It's just a bit hard to follow in the current form for a first-time reader, so I sug" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [08:09:14] (03PS5) 10Muehlenhoff: Use the CAS REMOTE_USER http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [08:10:13] (03PS6) 10Muehlenhoff: Use the HTTP_X_CAS_UID http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [08:13:20] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: fix oozie shlib path [puppet] - 10https://gerrit.wikimedia.org/r/628752 [08:15:29] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: fix oozie shlib path [puppet] - 10https://gerrit.wikimedia.org/r/628752 (owner: 10Elukey) [08:15:49] !log roll-restart swift-object-replicator in codfw and eqiad for increased concurrency [08:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:39] (03CR) 10Muehlenhoff: "Updated PCC: https://puppet-compiler.wmflabs.org/compiler1001/25201/" [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [08:18:34] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12678 and previous config saved to /var/cache/conftool/dbconfig/20200921-081833-kormat.json [08:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:39] T263244: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 [08:21:19] !log swift codfw-prod: bump weight for ms-be2057 - T261633 [08:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:23] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [08:25:10] (03CR) 10Elukey: [C: 03+2] Use the HTTP_X_CAS_UID http header for hue-next authentication [puppet] - 10https://gerrit.wikimedia.org/r/628741 (owner: 10Elukey) [08:29:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Remove eventgate-analytics non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627529 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [08:29:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Remove eventgate-analytics non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627530 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [08:31:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Retire stub firejail code in service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/622350 (owner: 10Muehlenhoff) [08:33:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12679 and previous config saved to /var/cache/conftool/dbconfig/20200921-083337-kormat.json [08:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:43] T263244: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 [08:34:53] 10Operations, 10observability: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10JMeybohm) 05Resolved→03Open I've just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/627298 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/627432 and did... [08:43:20] (03PS1) 10Marostegui: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/628611 [08:44:56] (03PS1) 10Giuseppe Lavagetto: wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878) [08:47:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2127 T262247', diff saved to https://phabricator.wikimedia.org/P12680 and previous config saved to /var/cache/conftool/dbconfig/20200921-084730-marostegui.json [08:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:36] T262247: db2127 memory errors - https://phabricator.wikimedia.org/T262247 [08:47:59] !log Stop MySQL on db2127 for on-site maintenance - T262247 [08:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:40] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12681 and previous config saved to /var/cache/conftool/dbconfig/20200921-084840-kormat.json [08:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:45] T263244: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 [08:49:56] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) @Papaul db2127's is now off, you can proceed whenever you want with the upgrades [08:50:38] (03CR) 10Kormat: [C: 03+2] Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/628611 (owner: 10Marostegui) [09:00:05] (03PS7) 10Rosalie Perside (WMDE): Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [09:02:02] (03PS1) 10KartikMistry: Exclude testwikis and private wikis from CX draft purge script run [puppet] - 10https://gerrit.wikimedia.org/r/628758 (https://phabricator.wikimedia.org/T261189) [09:03:44] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12682 and previous config saved to /var/cache/conftool/dbconfig/20200921-090343-kormat.json [09:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:50] T263244: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 [09:05:58] (03PS2) 10KartikMistry: Exclude testwikis and private wikis from CX draft purge script run [puppet] - 10https://gerrit.wikimedia.org/r/628758 (https://phabricator.wikimedia.org/T263417) [09:15:52] (03PS8) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [09:16:11] (03CR) 10Hashar: "Rebased after Ie71f823137a2580a5f0bd7cc8d88f43dbe516169" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [09:19:50] (03PS6) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [09:20:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [09:20:14] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [09:22:49] (03CR) 10Muehlenhoff: [C: 03+2] Add grafana-rw to cache config [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [09:25:00] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10fgiunchedi) It looks like citoid is now on k8s but still using gelf for logging, possibly the easiest at this point is switching to stdout... [09:29:14] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10fgiunchedi) Hi all, it looks like we've moved to syslog logging in https://gerrit.wikimedia.org/r/c/maps/kartotherian/deploy/... [09:34:30] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi wikifeeds is logging using k8s -> stdout -> logging pipeline: ` "hostname": "wikifeeds-production-5... [09:34:32] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [09:35:24] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [09:35:47] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10fgiunchedi) 05Resolved→03Open Correction: wikifeeds is using stdout and gelf, the latter can be removed [09:37:25] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10fgiunchedi) a:05fgiunchedi→03None [09:44:25] (03PS2) 10Muehlenhoff: urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628102 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [09:45:56] 10Operations, 10observability: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423 (10fgiunchedi) [09:48:04] (03PS4) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [09:48:10] (03CR) 10Kormat: bsection: Script for binary-searching log files. (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [09:48:30] (03CR) 10Muehlenhoff: [C: 03+2] urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628102 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [09:50:20] (03PS5) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [09:50:34] (03CR) 10Kormat: bsection: Script for binary-searching log files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [09:56:42] (03PS1) 10Muehlenhoff: urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628763 (https://phabricator.wikimedia.org/T244153) [10:03:49] (03CR) 10Giuseppe Lavagetto: bsection: Script for binary-searching log files. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [10:17:01] (03PS5) 10Effie Mouzeli: Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) [10:22:12] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) >>! In T222377#6478651, @fgiunchedi wrote: > Hi all, it looks like we've moved to syslog logging in https://gerrit.w... [10:22:38] (03CR) 10Effie Mouzeli: [C: 03+2] Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:23:15] (03PS3) 10Jbond: nginx: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:23:58] (03PS4) 10Jbond: nginx: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:25:20] (03CR) 10Jbond: nginx: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:26:15] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628766 (https://phabricator.wikimedia.org/T128546) [10:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T1030). [10:30:17] (03PS5) 10Jbond: nginx: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:30:54] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628766 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:31:54] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628766 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:13] (03PS10) 10JMeybohm: lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:34:15] (03PS8) 10JMeybohm: lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:34:17] (03PS2) 10JMeybohm: lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/624014 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:34:19] (03PS4) 10JMeybohm: lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:35:06] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:628766| Bumping portals to master (T128546)]] (duration: 01m 12s) [10:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:13] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:35:54] (03CR) 10JMeybohm: [C: 03+1] "Rebased" [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:36:05] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:628766| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:50] (03PS8) 10Jbond: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [10:46:00] (03CR) 10Jbond: "see inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [10:46:07] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [10:46:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:06] (03CR) 10Jbond: "LGTM will merge" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:47:12] (03CR) 10Jbond: [C: 03+2] nginx: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:48:57] (03CR) 10Effie Mouzeli: [C: 03+2] lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:54:23] (03CR) 10Effie Mouzeli: [C: 03+2] lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:55:12] (03PS9) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [10:57:50] (03CR) 10Jbond: [C: 04-1] "minor error, see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [10:58:48] Some LVS alerts might firre [10:58:50] fire* [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T1100). [11:00:04] Kizule and edsanders: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] I take Kizule's patch, per his request [11:00:35] edsanders: hi! [11:00:37] (03CR) 10Urbanecm: [C: 03+2] Add archive.wul.waseda.ac.jp to the wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628555 (https://phabricator.wikimedia.org/T261037) (owner: 10Zoranzoki21) [11:01:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:01:26] (03CR) 10Jbond: "change looks good, should we also move the reference to the yubiauth server from modules/role/templates/bastionhost/pam-sshd.erb ?" [puppet] - 10https://gerrit.wikimedia.org/r/628743 (owner: 10Muehlenhoff) [11:01:29] present [11:01:33] XioNoX: FYI (icinga)^^^ [11:01:39] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.56:4890]) https://wikitech.wikimedia.org/wiki/PyBal [11:01:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:02:06] volans: I guess the eqiad-eqord link ^ :) [11:02:32] Planned Work PWIC115905 from Telia [11:02:34] !log restart pybal on lvs2010 and lvs1016 - T256973 [11:02:35] edsanders: cool! [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:40] T256973: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 [11:02:48] (03CR) 10Urbanecm: [C: 03+2] Simplify lead paragraph check [extensions/MobileFrontend] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628626 (owner: 10Esanders) [11:02:53] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.56:4890]) https://wikitech.wikimedia.org/wiki/PyBal [11:03:45] (03PS2) 10Urbanecm: Add *.70yearsindonesiaaustralia.com to the wgCopyUploadsDomains allowlist of commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628581 (https://phabricator.wikimedia.org/T262238) (owner: 10Evrifaessa) [11:03:49] (03CR) 10Urbanecm: [C: 03+2] Add *.70yearsindonesiaaustralia.com to the wgCopyUploadsDomains allowlist of commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628581 (https://phabricator.wikimedia.org/T262238) (owner: 10Evrifaessa) [11:03:56] (03PS3) 10Urbanecm: Add archive.wul.waseda.ac.jp to the wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628555 (https://phabricator.wikimedia.org/T261037) (owner: 10Zoranzoki21) [11:04:06] (03CR) 10Urbanecm: [C: 03+2] Add archive.wul.waseda.ac.jp to the wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628555 (https://phabricator.wikimedia.org/T261037) (owner: 10Zoranzoki21) [11:04:13] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 70 connections established with conf1004.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [11:04:36] (03Merged) 10jenkins-bot: Add *.70yearsindonesiaaustralia.com to the wgCopyUploadsDomains allowlist of commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628581 (https://phabricator.wikimedia.org/T262238) (owner: 10Evrifaessa) [11:04:54] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) >>! In T254939#6471243, @BGerdemann wrote: > @jbond , a reminder that Andrew's contract is up. Thank... [11:05:05] (03Merged) 10jenkins-bot: Add archive.wul.waseda.ac.jp to the wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628555 (https://phabricator.wikimedia.org/T261037) (owner: 10Zoranzoki21) [11:05:35] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:06:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bd51f47b1f60fbfafdcc623ae22dcadf2c927876: Add *.70yearsindonesiaaustralia.com to the wgCopyUploadsDomains allowlist of commonswiki (T262238) (duration: 00m 57s) [11:06:19] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.56:4890]) https://wikitech.wikimedia.org/wiki/PyBal [11:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:23] T262238: Add www.70yearsindonesiaaustralia.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T262238 [11:06:45] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 60 connections established with conf2001.codfw.wmnet:2379 (min=61) https://wikitech.wikimedia.org/wiki/PyBal [11:07:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 01ba82866f3e04c7c635e9089fed4269190b93f0: Add archive.wul.waseda.ac.jp to the wgCopyUploadDomains (T261037) (duration: 00m 57s) [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:45] T261037: Add archive.wul.waseda.ac.jp to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T261037 [11:08:01] (03PS2) 10Urbanecm: Set 'WT' namespace alias to NS_PROJECT in shn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628590 (https://phabricator.wikimedia.org/T256348) (owner: 10Evrifaessa) [11:08:05] (03CR) 10Urbanecm: [C: 03+2] Set 'WT' namespace alias to NS_PROJECT in shn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628590 (https://phabricator.wikimedia.org/T256348) (owner: 10Evrifaessa) [11:08:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - push-notifications_4890: Servers kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:08:48] (03Merged) 10jenkins-bot: Set 'WT' namespace alias to NS_PROJECT in shn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628590 (https://phabricator.wikimedia.org/T256348) (owner: 10Evrifaessa) [11:09:00] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/628743 (owner: 10Muehlenhoff) [11:09:37] ^ LVS alerts are known [11:09:44] (03PS1) 10Jbond: admin: remove razzi from analytics_admins_members as they are in ops [puppet] - 10https://gerrit.wikimedia.org/r/628768 [11:10:05] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - push-notifications_4890: Servers kubernetes1014.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:11:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1cf4664df87f10bf60b47345dfe3c52d7dc24f6c: Set WT namespace alias to NS_PROJECT in shn.wiktionary (T256348) (duration: 00m 57s) [11:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:17] T256348: set 'WT' namespace alias to NS_PROJECT in shn.wiktionary - https://phabricator.wikimedia.org/T256348 [11:11:53] 10Operations: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10Aklapper) [11:12:06] !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=shnwiktionary --fix # T256348 # P12683 [11:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] (03PS2) 10Muehlenhoff: Remove the Yubikey HSM classes [puppet] - 10https://gerrit.wikimedia.org/r/628743 [11:13:06] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.56:4890]) JMeybohm Bug in LVS service definition, fixing. Bug: T256973 https://wikitech.wikimedia.org/wiki/PyBal [11:13:06] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 70 connections established with conf1004.eqiad.wmnet:4001 (min=71) JMeybohm Bug in LVS service definition, fixing. Bug: T256973 https://wikitech.wikimedia.org/wiki/PyBal [11:13:06] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - push-notifications_4890: Servers kubernetes1014.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled JMeybohm Bug in LVS service definition, fixing. Bug: [11:13:06] wikitech.wikimedia.org/wiki/PyBal [11:13:06] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.56:4890]) JMeybohm Bug in LVS service definition, fixing. Bug: T256973 https://wikitech.wikimedia.org/wiki/PyBal [11:13:06] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 60 connections established with conf2001.codfw.wmnet:2379 (min=61) JMeybohm Bug in LVS service definition, fixing. Bug: T256973 https://wikitech.wikimedia.org/wiki/PyBal [11:13:06] ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - push-notifications_4890: Servers kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled JMeybohm Bug in LVS service definition, fixing. Bug: [11:13:07] wikitech.wikimedia.org/wiki/PyBal [11:13:22] <_joe_> choose not to send notifications with acknowledgements [11:13:35] sorry for the noise [11:13:54] _joe_: forgot about that. Sorry [11:13:59] <_joe_> np :) [11:15:14] (03PS2) 10Urbanecm: Create Portal and Portal_talk namespaces on trwikisource, and fix an incorrect alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628598 (https://phabricator.wikimedia.org/T263358) (owner: 10Evrifaessa) [11:15:32] (03PS3) 10Urbanecm: Set timezone for wikis of the CWIRP to Europe/Rome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [11:15:34] (03PS2) 10Urbanecm: Removing Wikipedia store link from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628521 (https://phabricator.wikimedia.org/T262329) (owner: 10Evrifaessa) [11:16:18] (03PS2) 10Urbanecm: Allow local steward group members to bigdelete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628522 [11:16:23] (03CR) 10Urbanecm: [C: 03+2] Allow local steward group members to bigdelete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628522 (owner: 10Urbanecm) [11:16:25] (03PS1) 10Effie Mouzeli: hiera:service Fix push-notification service port [puppet] - 10https://gerrit.wikimedia.org/r/628769 [11:17:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/628743 (owner: 10Muehlenhoff) [11:17:20] (03Merged) 10jenkins-bot: Allow local steward group members to bigdelete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628522 (owner: 10Urbanecm) [11:18:42] (03CR) 10JMeybohm: [C: 03+1] hiera:service Fix push-notification service port [puppet] - 10https://gerrit.wikimedia.org/r/628769 (owner: 10Effie Mouzeli) [11:19:04] (03CR) 10Effie Mouzeli: [C: 03+2] hiera:service Fix push-notification service port [puppet] - 10https://gerrit.wikimedia.org/r/628769 (owner: 10Effie Mouzeli) [11:20:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a62212a5a8f4692b860eb3d6c3322c82d88125a9: Allow local steward group members to bigdelete (duration: 00m 57s) [11:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:36] ^ more LVS alerts coming [11:20:42] <3 [11:22:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "As far as I can tell, this is good to deploy at any time." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [11:22:29] !log restart pybal on lvs2010 and lvs1016 - T256973 [11:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:34] T256973: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 [11:23:02] (03Merged) 10jenkins-bot: Simplify lead paragraph check [extensions/MobileFrontend] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628626 (owner: 10Esanders) [11:25:02] edsanders: your patch is available at mwdebug2001, can you test, please? [11:25:18] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:25:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:27:58] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:28:24] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.56:4890]) https://wikitech.wikimedia.org/wiki/PyBal [11:28:34] edsanders: ping? 🙂 [11:28:42] looking [11:28:50] thanks [11:30:36] Looks good to me [11:30:54] thanks, syncing [11:31:18] PROBLEM - Host db2125 is DOWN: PING CRITICAL - Packet loss = 100% [11:32:28] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/MobileFrontend/includes/Transforms/MoveLeadParagraphTransform.php: 3fab5882505809b412cff641a17ae5d973db04a4: Simplify lead paragraph check (duration: 00m 59s) [11:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:33] edsanders: should be synced [11:33:48] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:34:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:35:37] !log EU B&C done [11:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:50] RECOVERY - Host db2125 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [11:36:36] kormat: ^ [11:36:43] looks like our friend db2125 just crashed [11:36:49] going to downtime it before it pages [11:37:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125 - crashed', diff saved to https://phabricator.wikimedia.org/P12684 and previous config saved to /var/cache/conftool/dbconfig/20200921-113708-marostegui.json [11:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:40] !log restart pybal on lvs2009 and lvs1015 - T256973 [11:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:46] T256973: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 [11:39:34] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:42:42] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 61 connections established with conf2001.codfw.wmnet:2379 (min=61) https://wikitech.wikimedia.org/wiki/PyBal [11:42:44] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:42:44] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 71 connections established with conf1004.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [11:45:24] marostegui: ack, thanks [11:47:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) This host crashed again: ` -------------------------------------------------------------------------------- SeqNumber = 890... [11:47:20] (03PS1) 10Hnowlan: api-gateway: Fall through to the appservers if a route isn't matched [deployment-charts] - 10https://gerrit.wikimedia.org/r/628772 (https://phabricator.wikimedia.org/T263045) [11:47:25] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) 05Resolved→03Open [11:48:47] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) [11:48:49] (03PS1) 10Lucas Werkmeister (WMDE): Stop using $wmgExtraLanguageNames in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628774 (https://phabricator.wikimedia.org/T263441) [11:48:51] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628775 (https://phabricator.wikimedia.org/T263441) [11:48:53] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628776 [11:49:21] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 8 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Elitre) [11:49:49] 10Operations: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10MoritzMuehlenhoff) > One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch,... [11:49:50] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628777 (https://phabricator.wikimedia.org/T260670) [11:50:16] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628777 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [11:50:20] (03CR) 10Kormat: [C: 03+1] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628777 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [11:51:22] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628776 (https://phabricator.wikimedia.org/T263441) [11:53:14] (03CR) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [12:00:20] (03PS1) 10KartikMistry: Remove test2wiki, as ContentTranslation is not enabled there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628780 [12:02:27] (03PS1) 10Jbond: profile::base::puppet: remove export_p12 function [puppet] - 10https://gerrit.wikimedia.org/r/628781 (https://phabricator.wikimedia.org/T253957) [12:02:29] (03PS1) 10Jbond: sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 [12:02:31] (03PS1) 10Jbond: base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) [12:03:40] (03CR) 10jerkins-bot: [V: 04-1] profile::base::puppet: remove export_p12 function [puppet] - 10https://gerrit.wikimedia.org/r/628781 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:03:50] (03CR) 10jerkins-bot: [V: 04-1] sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (owner: 10Jbond) [12:03:53] (03CR) 10jerkins-bot: [V: 04-1] base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:04:46] (03CR) 10Muehlenhoff: "These can also go away, see the /usr/local/bin/cross-validate-accounts accountcheck mail:" [puppet] - 10https://gerrit.wikimedia.org/r/628768 (owner: 10Jbond) [12:06:14] (03PS1) 10Effie Mouzeli: hieradata:worker.yaml fix typo in push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/628786 [12:09:37] (03PS1) 10Jbond: profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) [12:10:49] Creating new table(s) on testwiki is handled during Config backport window, right? [12:10:54] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:11:14] (03CR) 10JMeybohm: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25207/" [puppet] - 10https://gerrit.wikimedia.org/r/628786 (owner: 10Effie Mouzeli) [12:11:30] kart_: normally yeah, is that table intended to be everywhere or just testwiki? [12:11:38] @Urbanecm ^ [12:11:53] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata:worker.yaml fix typo in push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/628786 (owner: 10Effie Mouzeli) [12:12:04] (03CR) 10Effie Mouzeli: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/25207/kubernetes1008.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/628786 (owner: 10Effie Mouzeli) [12:12:32] marostegui: testwiki. It seems CX shared tables on wikishared is creating issue with draft purge, so keeping out testwiki seems better. [12:12:52] Ah only wikishared, gotcha :) [12:12:58] https://phabricator.wikimedia.org/T263417 [12:13:35] kart_: yeah, I was worried about that table being created on sX sections and not knowing whether it can be replicated to labs hosts or not. but if it is x1, that's ok, yeah [12:17:19] (03PS2) 10Jbond: profile::base::puppet: remove export_p12 function [puppet] - 10https://gerrit.wikimedia.org/r/628781 (https://phabricator.wikimedia.org/T253957) [12:18:37] kart_: yes? How may I help? [12:18:41] (03PS2) 10Jbond: sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 [12:19:42] (03CR) 10jerkins-bot: [V: 04-1] sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (owner: 10Jbond) [12:20:24] (03PS3) 10Jbond: sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) [12:21:19] (03CR) 10Effie Mouzeli: [C: 03+2] lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/624014 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:21:20] (03CR) 10CDanis: [C: 03+1] hieradata: bump swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/628095 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [12:21:26] (03PS3) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/624014 (https://phabricator.wikimedia.org/T256973) [12:21:33] (03Abandoned) 10CDanis: repool eqiad [dns] - 10https://gerrit.wikimedia.org/r/627922 (owner: 10CDanis) [12:23:27] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:26:17] !log Set innodb_change_buffering = all; on db2129 (s6 master) for performance testing T263443 [12:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:22] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [12:26:49] !log Set innodb_change_buffering = all; on db2071 (s1 slave) for performance testing T263443 [12:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:02] (03CR) 10Ema: [C: 03+1] "LGTM modulo a missing \ in the regexp -- office\." [puppet] - 10https://gerrit.wikimedia.org/r/627629 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [12:29:11] Urbanecm: Sorry, was bit afk. will put CX table creation task for next backport/config window (Do we have short name yet for this?). [12:30:06] I usually go for B&C :) [12:30:42] (03PS3) 10Jbond: profile::base::puppet: remove export_p12 function [puppet] - 10https://gerrit.wikimedia.org/r/628781 (https://phabricator.wikimedia.org/T253957) [12:30:44] (03PS4) 10Jbond: sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) [12:30:46] (03PS2) 10Jbond: base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) [12:31:49] IIRC tables creations can be done at any time, but @marostegui would know more. I also think you need to let DBAs know if the table contain private data. [12:32:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove the Yubikey HSM classes [puppet] - 10https://gerrit.wikimedia.org/r/628743 (owner: 10Muehlenhoff) [12:32:15] (03CR) 10jerkins-bot: [V: 04-1] base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:32:19] Urbanecm: yeah, exactly, I saw the "new table" but didn't recall any ticket recently about new tables [12:32:45] (03CR) 10Jbond: [C: 03+2] "This function is unused, removing" [puppet] - 10https://gerrit.wikimedia.org/r/628781 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:33:25] marostegui: sure. I'll point to task and the issue before :) [12:34:03] marostegui: just to improve my knowledge, will something bad happen when private tables are created w/o all green from you? [12:34:22] (I don't plan to create any right now, I am curious) [12:34:56] (03PS5) 10Jbond: sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) [12:35:01] Urbanecm: No, we have a safety measure, even if the table is replicated, there's no view associated to it, so nothing will be able to query until the view is created [12:35:05] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:35:19] kart_: Do you have the table creation task somewhere handy? [12:35:27] (03PS2) 10Jbond: profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) [12:35:56] (03PS3) 10Jbond: base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) [12:36:18] Urbanecm: We don't replicate x1 at all, but even with that, we'd prefer to know if there is a table that shouldnt be replicated there [12:36:34] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:36:36] Just in case we decide to start including x1 on labsdb hosts...so we prefer to filter it [12:37:04] (03CR) 10jerkins-bot: [V: 04-1] base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:37:50] (03PS5) 10JMeybohm: lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:38:13] (03PS4) 10Effie Mouzeli: Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) [12:38:50] !log set same OSPF metric on both eqiad/codfw links - T263230 [12:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:55] T263230: Set the same OSPF weight on eqiad/codfw wavelenghts - https://phabricator.wikimedia.org/T263230 [12:39:18] (03PS1) 10KartikMistry: ContentTranslation: Move testwiki off extension1 cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628790 (https://phabricator.wikimedia.org/T263417) [12:39:23] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:40:40] marostegui: https://phabricator.wikimedia.org/T263417 - just added you as subscriber for context. [12:41:39] thank you, I will comment there [12:41:53] (03CR) 10Effie Mouzeli: [C: 03+2] lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:42:30] marostegui: Cool. No hurry! [12:43:14] 10Operations, 10netops: Set the same OSPF weight on eqiad/codfw wavelenghts - https://phabricator.wikimedia.org/T263230 (10ayounsi) 05Open→03Resolved [12:43:16] 10Operations, 10Traffic, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10ayounsi) [12:47:13] (03PS5) 10JMeybohm: Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:48:09] (03CR) 10JMeybohm: [C: 03+1] "🎉" [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:49:51] (03PS4) 10Jbond: base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) [12:50:44] (03CR) 10Effie Mouzeli: [C: 03+2] Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:52:06] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/628768 (owner: 10Jbond) [12:55:30] (03PS12) 10CDanis: Serve Network Error Logging headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 (https://phabricator.wikimedia.org/T257527) [12:55:50] (03CR) 10Jbond: "PCC shows (essentially a noop) the addition of the password parameter and the addition of managing two new files which are set to absent" [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [12:56:05] (03PS3) 10Jbond: profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) [12:59:07] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=push-notifications,name=codfw [12:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:42] (03CR) 10Ottomata: [C: 03+1] lvs: Remove eventgate-analytics non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627530 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [13:04:56] (03CR) 10Ottomata: [C: 03+1] lvs: Remove eventgate-analytics non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627529 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [13:05:10] (03CR) 10Ottomata: [C: 03+1] lvs: Remove eventgate-main non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627537 (https://phabricator.wikimedia.org/T255873) (owner: 10JMeybohm) [13:05:17] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @JMeybohm and I finished up, you can reach the production environments by using `curl -k https... [13:05:28] (03CR) 10Ottomata: [C: 03+1] lvs: Remove eventgate-main non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627538 (https://phabricator.wikimedia.org/T255873) (owner: 10JMeybohm) [13:06:25] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Mholloway) Thank you, @jijiki! [13:09:55] 10Operations, 10serviceops: validate that profile::lvs::realserver::pools has valid hostnames - https://phabricator.wikimedia.org/T263454 (10jijiki) [13:10:04] 10Operations, 10serviceops: validate that profile::lvs::realserver::pools has valid hostnames - https://phabricator.wikimedia.org/T263454 (10jijiki) p:05Triage→03Medium [13:10:30] (03CR) 10Elukey: [C: 03+1] admin: remove razzi from analytics_admins_members as they are in ops [puppet] - 10https://gerrit.wikimedia.org/r/628768 (owner: 10Jbond) [13:11:47] (03CR) 10Elukey: [C: 03+1] sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:12:15] (03CR) 10Jbond: [C: 03+2] sslcert::x509_to_p12: add owner/group parameters to manage the p12 file [puppet] - 10https://gerrit.wikimedia.org/r/628782 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:13:33] (03PS13) 10CDanis: Serve Network Error Logging headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 (https://phabricator.wikimedia.org/T257527) [13:14:33] (03CR) 10Elukey: base::expose_puppet_certs: update p12 interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:15:09] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10faidon) BTW, one dangerous impact of this (as with all ECMP!) is that it would harder to notice a situation where we don't have enough capacity to carry regular amounts of traffic when one of the... [13:15:44] (03CR) 10CDanis: [C: 03+2] Serve Network Error Logging headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [13:21:07] !log Set innodb_change_buffering = inserts; on db2081 (s8 slave) for performance testing T263443 [13:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [13:21:45] (03CR) 10Jbond: [C: 03+2] base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:21:47] !log installing glib-networking security updates for Stretch [13:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:30] (03CR) 10Jbond: "Sorry +2 before seeing your comment however see inline dir is created" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:24:00] 10Operations, 10conftool: dbctl: instance edit should format all the data in the normal yaml key/value mapping - https://phabricator.wikimedia.org/T263458 (10Kormat) [13:25:06] 10Operations, 10conftool: dbctl: instance edit should format all the data in the normal yaml key/value mapping - https://phabricator.wikimedia.org/T263458 (10Kormat) p:05Triage→03Medium [13:26:22] 10Operations: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10Reedy) Numerous tasks for stuff like this, which kind of includes {T37378} (not exactly the same), but could help in similar situations, including transition between 1 collation and another on a wiki (again, th... [13:27:19] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:36] (03PS5) 10Jbond: base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) [13:29:42] 10Operations, 10Traffic, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10elukey) > are the netflow boxes and also the Analytics pipelines involved going to be okay if we are sending a great number of more flows? Do we have a high level estimate of what will... [13:29:52] (03CR) 10Jbond: [C: 03+2] base::expose_puppet_certs: update p12 interface [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:30:15] (03PS4) 10Jbond: profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) [13:30:50] (03PS1) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) [13:30:52] (03PS1) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628800 (https://phabricator.wikimedia.org/T244843) [13:30:54] (03PS1) 10Giuseppe Lavagetto: services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) [13:30:56] (03PS1) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) [13:30:58] (03PS1) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) [13:31:37] (03CR) 10Elukey: [C: 03+1] base::expose_puppet_certs: update p12 interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628783 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:33:00] (03PS5) 10Jbond: profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) [13:35:00] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25213/analytics1028.eqiad.wmnet/change.analytics1028.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:35:40] (03CR) 10Elukey: profile::hadoop::common: migrate to base_exspose_puppet_cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:37:31] (03PS6) 10Jbond: profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) [13:38:36] (03CR) 10Jbond: "updated: pcc (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/25215" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:44:51] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:55] 10Operations, 10conftool, 10User-Kormat: dbctl: add way to see list all servers in a section and see what groups they are in - https://phabricator.wikimedia.org/T263460 (10Kormat) p:05Triage→03Medium [13:45:23] 10Operations, 10conftool, 10User-Kormat: dbctl: instance edit workflow will throw away all changes if you forget to use sudo - https://phabricator.wikimedia.org/T263459 (10Kormat) [13:46:39] (03CR) 10Jbond: "The PCC (https://puppet-compiler.wmflabs.org/compiler1001/25215/analytics1028.eqiad.wmnet/index.html) Make things loko a bit worse then th" [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:47:34] (03CR) 10Elukey: [C: 03+1] "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:49:25] (03CR) 10Jbond: [C: 03+2] profile::hadoop::common: migrate to base_exspose_puppet_cert [puppet] - 10https://gerrit.wikimedia.org/r/628787 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:50:48] (03PS1) 10Hnowlan: api-gateway: document values a bit better. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628804 (https://phabricator.wikimedia.org/T254916) [13:52:56] 10Operations, 10conftool: dbctl: instance edit should format all the data in the normal yaml key/value mapping - https://phabricator.wikimedia.org/T263458 (10Kormat) I just noticed that if an instance is already in some groups, the formatting is different again: ` # Editing object codfw/db2117 host_ip: 10.192.... [13:58:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool db2117 for schema change, add db2124 to dump/vslow in the interim T259831', diff saved to https://phabricator.wikimedia.org/P12687 and previous config saved to /var/cache/conftool/dbconfig/20200921-135821-kormat.json [13:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:26] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:00:04] !log installing Java security updates on restbase/sessionstore* [14:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:45] (03PS1) 10Jbond: sslcert::x509_to_pkcs12: check for a valid p12 instead of just a file [puppet] - 10https://gerrit.wikimedia.org/r/628827 [14:04:03] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25217/" [puppet] - 10https://gerrit.wikimedia.org/r/628827 (owner: 10Jbond) [14:04:05] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:05] (03PS1) 10Elukey: profile::presto::server: allow the usage of local puppet TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/628829 (https://phabricator.wikimedia.org/T253957) [14:11:29] !log disconnecting mgmt on msw-d6-codfw to re-do cable end T263138 [14:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:34] T263138: codfw: bad mgmt cable end msw-d6, msw-c1 - https://phabricator.wikimedia.org/T263138 [14:14:17] PROBLEM - Host db2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:14:17] PROBLEM - Host dbproxy2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:15:07] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:16:39] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.73 ms [14:17:21] (03PS2) 10Elukey: profile::presto::server: allow the usage of local puppet TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/628829 (https://phabricator.wikimedia.org/T253957) [14:17:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12688 and previous config saved to /var/cache/conftool/dbconfig/20200921-141722-kormat.json [14:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:32] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:19:34] (03PS1) 10Herron: prometheus: point prometheus.svc.eqsin to prometheus5001 [dns] - 10https://gerrit.wikimedia.org/r/628847 (https://phabricator.wikimedia.org/T243057) [14:20:11] RECOVERY - Host db2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.98 ms [14:20:11] RECOVERY - Host dbproxy2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.89 ms [14:21:35] !log Set innodb_change_buffering = inserts; on db2125 (s2 slave) for performance testing T263443 [14:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:41] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [14:22:13] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) [14:24:03] !log disconnecting mgmt on msw-c1-codfw to re-do cable end T263138 [14:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:08] T263138: codfw: bad mgmt cable end msw-d6, msw-c1 - https://phabricator.wikimedia.org/T263138 [14:27:19] (03PS1) 10Ladsgroup: Introduce and use StatsdMonitoring trait in term store [extensions/Wikibase] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628808 (https://phabricator.wikimedia.org/T262923) [14:28:19] PROBLEM - Host es2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:28:28] mmmm [14:28:30] ah [14:28:31] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:28:31] PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:28:31] PROBLEM - Host cloudservices2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:28:32] mgmt, pheeew [14:28:43] PROBLEM - Host pc2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:28:53] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:29:07] PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:29:41] PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:29:53] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.51 ms [14:30:18] 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) [14:30:29] !log moving prometheus from bast5001 to prometheus5001 T243057 [14:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:35] T243057: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 [14:32:26] !log kormat@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12689 and previous config saved to /var/cache/conftool/dbconfig/20200921-143226-kormat.json [14:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:31] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:32:39] (03CR) 10Herron: [C: 03+2] prometheus: point prometheus.svc.eqsin to prometheus5001 [dns] - 10https://gerrit.wikimedia.org/r/628847 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [14:32:47] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:58] (03CR) 10Elukey: [C: 03+2] profile::presto::server: allow the usage of local puppet TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/628829 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [14:33:43] (03PS3) 10Filippo Giunchedi: am: use status.cgi JSON as source for problems [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 [14:34:17] RECOVERY - Host es2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [14:34:31] RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [14:34:31] RECOVERY - Host cloudservices2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.60 ms [14:34:51] RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [14:35:07] RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [14:35:17] RECOVERY - Host pc2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.97 ms [14:35:39] RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [14:36:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:36:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] !log installing qemu security updates on ganeti2011 and gnt-instance reboot debmonitor2001 [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:29] !log firmware upgrade on db2127 [14:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] (03PS1) 10Elukey: profile::presto::server: fix ownership of ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/628848 [14:39:28] (03CR) 10Elukey: [C: 03+2] profile::presto::server: fix ownership of ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/628848 (owner: 10Elukey) [14:40:48] !log installing qemu security updates on ganeti* stretch nodes [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:11] PROBLEM - Host mw2256.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:53] (03PS1) 10Jbond: cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) [14:43:47] PROBLEM - Check systemd state on ms-be2049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:00] (03PS1) 10Elukey: presto: use puppet host TLS certificates by default [puppet] - 10https://gerrit.wikimedia.org/r/628850 (https://phabricator.wikimedia.org/T253957) [14:45:54] (03PS1) 10Jbond: pki::client: key needs to be 16 char hex [labs/private] - 10https://gerrit.wikimedia.org/r/628851 [14:45:58] (03CR) 10Jbond: [V: 03+2 C: 03+2] pki::client: key needs to be 16 char hex [labs/private] - 10https://gerrit.wikimedia.org/r/628851 (owner: 10Jbond) [14:47:30] !log kormat@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12690 and previous config saved to /var/cache/conftool/dbconfig/20200921-144729-kormat.json [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:35] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:49:24] (03PS2) 10Elukey: presto: use puppet host TLS certificates by default [puppet] - 10https://gerrit.wikimedia.org/r/628850 (https://phabricator.wikimedia.org/T253957) [14:49:32] (03PS2) 10Jbond: cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) [14:49:59] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [14:50:13] PROBLEM - SSH on ms-be2054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:51:22] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [14:52:33] (03PS3) 10Jbond: cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) [14:53:22] (03PS1) 10Herron: prometheus: reduce prometheus.svc.eqsin TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/628853 [14:53:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-sidecar site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:53:52] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25224/" [puppet] - 10https://gerrit.wikimedia.org/r/628850 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [14:53:57] RECOVERY - SSH on ms-be2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:54:47] (03PS4) 10Jbond: cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) [14:55:02] that prometheus alert should clear shortly, just migrated that instance from the eqsin bastion to prometheus5001 [14:58:46] (03PS5) 10Jbond: cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) [15:00:04] (03PS6) 10Jbond: cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) [15:01:24] (03CR) 10Jbond: [C: 03+2] cfssl: update client class to use new config define [puppet] - 10https://gerrit.wikimedia.org/r/628849 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:01:42] (03Abandoned) 10Jbond: (WIP) cfssl::config: create generic define for generating configs [puppet] - 10https://gerrit.wikimedia.org/r/628135 (owner: 10Jbond) [15:02:33] !log kormat@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12691 and previous config saved to /var/cache/conftool/dbconfig/20200921-150233-kormat.json [15:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:39] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [15:04:57] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Papaul) @Marostegui firmware upgrade complete [15:05:27] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) Thank you @Papaul - I will take it from here! [15:07:38] !log installing libx11 security updates on stretch [15:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:01] !log rolling restart of mw canaries in codfw to pick up libx11 update [15:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:00] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Mholloway) Looking at other service definitions in mediawiki-config [[ https://github.com/wikimedia/op... [15:12:10] !log kormat@cumin1001 dbctl commit (dc=all): 'Take db2124 back out of dump/vslow T259831', diff saved to https://phabricator.wikimedia.org/P12692 and previous config saved to /var/cache/conftool/dbconfig/20200921-151210-kormat.json [15:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:16] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [15:12:39] RECOVERY - Check systemd state on ms-be2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:47] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:01] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [15:20:46] (03PS1) 10Muehlenhoff: Also extend Parsoid canaries to codfw [puppet] - 10https://gerrit.wikimedia.org/r/628856 [15:23:24] (03PS1) 10Andrew Bogott: Added snakeoil keydata for ceph glance client [labs/private] - 10https://gerrit.wikimedia.org/r/628858 (https://phabricator.wikimedia.org/T263461) [15:24:27] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [15:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:38] (03CR) 10Ladsgroup: [C: 03+2] Introduce and use StatsdMonitoring trait in term store [extensions/Wikibase] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628808 (https://phabricator.wikimedia.org/T262923) (owner: 10Ladsgroup) [15:24:43] !log roll-restarting restbase-dev for java security updates [15:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:36] (03PS1) 10Andrew Bogott: cloudcontrol eqiad1: add ceph access for Glance [puppet] - 10https://gerrit.wikimedia.org/r/628861 (https://phabricator.wikimedia.org/T263461) [15:25:39] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added snakeoil keydata for ceph glance client [labs/private] - 10https://gerrit.wikimedia.org/r/628858 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [15:27:12] Quickly deploying this stastd thingy for term store: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/628808 [15:27:52] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) As of about 13:30 UTC today, we started serving these response headers on [[... [15:29:51] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Joe) >>! In T256973#6480180, @Mholloway wrote: > Looking at other service definitions in mediawiki-con... [15:32:19] (03PS1) 10Jbond: cfssl: refactor server functionality out of cfssl [puppet] - 10https://gerrit.wikimedia.org/r/628862 (https://phabricator.wikimedia.org/T259117) [15:34:44] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:46] (03PS2) 10Andrew Bogott: cloudcontrol eqiad1: add ceph access for Glance [puppet] - 10https://gerrit.wikimedia.org/r/628861 (https://phabricator.wikimedia.org/T263461) [15:36:22] PROBLEM - Check systemd state on debmonitor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:22] (03PS1) 10Andrew Bogott: Renamed profile::ceph::client::rbd::glance_client_keydata [labs/private] - 10https://gerrit.wikimedia.org/r/628865 (https://phabricator.wikimedia.org/T263461) [15:38:34] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Renamed profile::ceph::client::rbd::glance_client_keydata [labs/private] - 10https://gerrit.wikimedia.org/r/628865 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [15:39:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12693 and previous config saved to /var/cache/conftool/dbconfig/20200921-153923-root.json [15:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:28] kormat: ^ <3 [15:39:28] T262247: db2127 memory errors - https://phabricator.wikimedia.org/T262247 [15:39:52] marostegui: \o/ [15:42:31] (03PS1) 10Andrew Bogott: Further attempts to get this key in the right place [labs/private] - 10https://gerrit.wikimedia.org/r/628869 (https://phabricator.wikimedia.org/T263461) [15:42:41] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Further attempts to get this key in the right place [labs/private] - 10https://gerrit.wikimedia.org/r/628869 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [15:42:49] (03CR) 10BBlack: [C: 03+1] Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [15:42:58] (03CR) 10BBlack: [C: 03+1] Migrate ulsfo private records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:43:32] (03PS2) 10Jbond: cfssl: refactor server functionality out of cfssl [puppet] - 10https://gerrit.wikimedia.org/r/628862 (https://phabricator.wikimedia.org/T259117) [15:44:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [15:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:59] (03Merged) 10jenkins-bot: Introduce and use StatsdMonitoring trait in term store [extensions/Wikibase] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628808 (https://phabricator.wikimedia.org/T262923) (owner: 10Ladsgroup) [15:50:12] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/Wikibase/lib/includes/Store/Sql/Terms/Util/StatsdMonitoring.php: [[gerrit:628808|Introduce and use StatsdMonitoring trait in term store (T262923), Part I]] (duration: 00m 59s) [15:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:18] T262923: Investigate sensible DB load group splits for Wikidata / Wikibase - https://phabricator.wikimedia.org/T262923 [15:50:24] (03PS3) 10Jbond: cfssl: refactor server functionality out of cfssl [puppet] - 10https://gerrit.wikimedia.org/r/628862 (https://phabricator.wikimedia.org/T259117) [15:51:36] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/Wikibase/lib/includes/Store/Sql/Terms/: [[gerrit:628808|Introduce and use StatsdMonitoring trait in term store (T262923), Part I]] (duration: 00m 56s) [15:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:39] (03CR) 10Jbond: [C: 03+2] cfssl: refactor server functionality out of cfssl [puppet] - 10https://gerrit.wikimedia.org/r/628862 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:53:32] (03CR) 10Dzahn: "there are actually 4 canaries, 2 per DC, wtp1025,wtp1026,wtp2001,wtp2002" [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [15:54:21] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12694 and previous config saved to /var/cache/conftool/dbconfig/20200921-155426-root.json [15:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:33] T262247: db2127 memory errors - https://phabricator.wikimedia.org/T262247 [15:55:53] mmmh puppet failures, seems mostly on the misc cluster [15:56:56] andrewbogott: seems related to the cloudcephmon hosts [15:57:14] ok, looking [15:57:23] (03PS3) 10Andrew Bogott: cloudcontrol eqiad1: add ceph client for Glance [puppet] - 10https://gerrit.wikimedia.org/r/628861 (https://phabricator.wikimedia.org/T263461) [15:57:25] (03PS1) 10Andrew Bogott: OpenStack glance: set the default backend to rbd [puppet] - 10https://gerrit.wikimedia.org/r/628872 (https://phabricator.wikimedia.org/T263461) [15:57:25] see warnings in icinga [15:57:40] (03PS2) 10JMeybohm: services_proxy: add push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/623790 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [16:00:18] (03PS1) 10Jbond: cfssl::config: drop client defaults for usages and expiry as they are ignored [puppet] - 10https://gerrit.wikimedia.org/r/628873 (https://phabricator.wikimedia.org/T259117) [16:02:34] (03CR) 10Jbond: [C: 03+2] cfssl::config: drop client defaults for usages and expiry as they are ignored [puppet] - 10https://gerrit.wikimedia.org/r/628873 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:08:32] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [16:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:44] thx! <3 [16:09:23] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00125 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:09:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12695 and previous config saved to /var/cache/conftool/dbconfig/20200921-160929-root.json [16:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:36] T262247: db2127 memory errors - https://phabricator.wikimedia.org/T262247 [16:16:21] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrol eqiad1: add ceph client for Glance [puppet] - 10https://gerrit.wikimedia.org/r/628861 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [16:16:33] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:55] !log replacing msw-c8-codfw [16:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:45] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12696 and previous config saved to /var/cache/conftool/dbconfig/20200921-162433-root.json [16:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:41] T262247: db2127 memory errors - https://phabricator.wikimedia.org/T262247 [16:25:29] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:38] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) 05Open→03Resolved [16:28:40] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [16:28:44] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [16:29:02] 10Operations, 10ops-codfw: codfw: bad mgmt cable end msw-d6, msw-c1 - https://phabricator.wikimedia.org/T263138 (10Papaul) 05Open→03Resolved Complete [16:30:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) a:05fdans→03mforns [16:34:13] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:35:17] (03PS1) 10Bstorm: toolforge: use locales::all for the grid [puppet] - 10https://gerrit.wikimedia.org/r/628879 (https://phabricator.wikimedia.org/T263339) [16:46:12] (03CR) 10BryanDavis: [C: 03+1] Create wiki replica views for MachineVision extension tables [puppet] - 10https://gerrit.wikimedia.org/r/623775 (https://phabricator.wikimedia.org/T238574) (owner: 10Cparle) [16:46:45] (03CR) 10Dzahn: [C: 03+2] "approved in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/628196 (https://phabricator.wikimedia.org/T263191) (owner: 10Nskaggs) [16:46:47] !log shutting down ms-be2019 for BBU replacing [16:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:53] PROBLEM - Host ms-be2019 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:44] (03CR) 10Dzahn: "@Nskaggs Feel free to try again and run a random command in Icinga web UI. if it doesn't work it could be capitalization of user name" [puppet] - 10https://gerrit.wikimedia.org/r/628196 (https://phabricator.wikimedia.org/T263191) (owner: 10Nskaggs) [16:52:26] (03CR) 10Ppchelko: "Hm... I totally get the desire and share it, but I'm on the fence. Can't formulate my concerns either, but replacing explicit allow list w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/628772 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [16:57:51] RECOVERY - Host ms-be2019 is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms [16:58:24] (03PS1) 10Mholloway: Update push-notifications config [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) [16:58:29] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:58:51] (03PS2) 10Mholloway: Update push-notifications config [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) [17:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T1700). [17:03:35] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:04:02] (03CR) 10JMeybohm: [C: 04-1] "You will need to bump the chart version in Chart.yaml for this to take effect." [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:04:25] (03Abandoned) 10Dzahn: yubiauth: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628462 (owner: 10Dzahn) [17:07:13] 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link [17:07:15] (03PS3) 10Mholloway: Update push-notifications config [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) [17:07:19] PROBLEM - Disk space on Hadoop worker on an-worker1084 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:07:29] (03CR) 10Mholloway: "> Patch Set 2: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:12:13] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link [17:13:04] (03CR) 10Bstorm: "I think I'll install this package by hand on a test server quick to see the difference." [puppet] - 10https://gerrit.wikimedia.org/r/628879 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [17:14:46] (03CR) 10Effie Mouzeli: [C: 03+2] services_proxy: add push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/623790 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [17:15:28] (03CR) 10Effie Mouzeli: [C: 03+2] "pcc https://puppet-compiler.wmflabs.org/compiler1002/25241/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623790 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [17:17:43] (03CR) 10Jgiannelos: [C: 03+2] "Change looks OK." [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:18:54] (03CR) 10Dzahn: [C: 04-1] "Will amend to add all 4 canary servers." [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:20:13] (03Merged) 10jenkins-bot: Update push-notifications config [deployment-charts] - 10https://gerrit.wikimedia.org/r/628881 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:22:46] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10crusnov) >>! In T233937#5696427, @jbond wrote: > @Volans just asked if there is a way to register multiple u2f devices to the same account. Of the top of my head... [17:23:20] (03CR) 10Dzahn: "wtp2020 does not match the conftool-data definition of which servers are the canaries. what was your source of truth? I took it from this:" [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:23:56] ACKNOWLEDGEMENT - HP RAID on ms-be2019 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T263484 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [17:23:59] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T263484 (10ops-monitoring-bot) [17:24:26] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:25:55] (03PS1) 10Mholloway: Add push-notifications service labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628882 (https://phabricator.wikimedia.org/T262936) [17:25:57] (03PS1) 10Mholloway: Add push-notifications service production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628883 (https://phabricator.wikimedia.org/T262936) [17:26:20] (03PS2) 10Dzahn: Add all parsoid canary servers to cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:26:30] RECOVERY - Disk space on Hadoop worker on an-worker1084 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:26:32] (03CR) 10Mholloway: [C: 04-2] "Hold for deploy window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628882 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:26:44] (03CR) 10jerkins-bot: [V: 04-1] Add push-notifications service labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628882 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:26:54] (03CR) 10Mholloway: [C: 04-2] "Hold for deploy window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628883 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:27:35] (03PS1) 10Effie Mouzeli: services_proxy: add push-notifications listener [puppet] - 10https://gerrit.wikimedia.org/r/628884 [17:27:37] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10jbond) > Just a bit of a note, since I asked for this exact thing on IRC. It'd be cool to be able to select from more than one U2F token. :) As mentioned on IRC,... [17:28:18] (03CR) 10Dzahn: [C: 03+1] "Effie, looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:30:18] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1002/25243/" [puppet] - 10https://gerrit.wikimedia.org/r/628884 (owner: 10Effie Mouzeli) [17:31:15] (03CR) 10JMeybohm: [C: 03+1] services_proxy: add push-notifications listener [puppet] - 10https://gerrit.wikimedia.org/r/628884 (owner: 10Effie Mouzeli) [17:31:44] (03CR) 10Effie Mouzeli: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:32:16] (03PS3) 10Mholloway: Echo: Set up common push settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) [17:32:18] (03PS3) 10Mholloway: Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) [17:32:20] (03PS3) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [17:33:24] (03CR) 10Effie Mouzeli: [C: 03+2] services_proxy: add push-notifications listener [puppet] - 10https://gerrit.wikimedia.org/r/628884 (owner: 10Effie Mouzeli) [17:34:05] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:34:57] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) a:05Jclark-ctr→03None [17:35:09] (03CR) 10Dzahn: [C: 03+2] "i'll merge this one and we can follow-up moving to other ones along with a conftool-data change if we want multiple rows" [puppet] - 10https://gerrit.wikimedia.org/r/628856 (owner: 10Muehlenhoff) [17:36:55] (03PS4) 10Mholloway: Echo: Set up common push settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) [17:36:57] (03PS4) 10Mholloway: Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) [17:36:59] (03PS4) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [17:37:01] (03PS1) 10Mholloway: Add push-notifications service config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628885 (https://phabricator.wikimedia.org/T262936) [17:37:32] (03Abandoned) 10Mholloway: Add push-notifications service production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628883 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:37:43] (03Abandoned) 10Mholloway: Add push-notifications service labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628882 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:38:45] (03CR) 10Mholloway: [C: 04-2] "Hold for deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628885 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [17:39:56] (03PS1) 10Elukey: hadoop: decrease default yarn log retention from 90d to 40d [puppet] - 10https://gerrit.wikimedia.org/r/628887 [17:40:53] (03Abandoned) 10Jbond: pki: only enable client auth for API directory and add fqdn to aliases [puppet] - 10https://gerrit.wikimedia.org/r/625866 (owner: 10Jbond) [17:42:46] !log rebooting ps1-a8-codfw firmware upgrade [17:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:06] (03CR) 10Ottomata: [C: 03+1] hadoop: decrease default yarn log retention from 90d to 40d [puppet] - 10https://gerrit.wikimedia.org/r/628887 (owner: 10Elukey) [17:43:13] (03CR) 10Elukey: [C: 03+2] hadoop: decrease default yarn log retention from 90d to 40d [puppet] - 10https://gerrit.wikimedia.org/r/628887 (owner: 10Elukey) [17:47:13] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10JMeybohm) >>! In T256973#6480180, @Mholloway wrote: > Looking at other service definitions in mediawik... [17:49:47] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10Papaul) BBU replacement complete [17:51:37] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Papaul) Drained the power, replaced both PSU's same issue. It has to be the CPU or main-board , pushing the power button on the sever doesn't power on the server. @wiki_will... [17:52:32] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Mholloway) Excellent! Thank you! [17:55:29] 10Operations, 10ops-codfw: ps1-a8-codfw WebUI unresponsive - https://phabricator.wikimedia.org/T263001 (10Papaul) 05Open→03Resolved Firmware upgrade to version 7.1e fix the issue. [17:55:34] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T1800). [18:00:04] RoanKattouw and Evrifaessa: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack glance: set the default backend to rbd [puppet] - 10https://gerrit.wikimedia.org/r/628872 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [18:00:15] Also one beta cliuster patch for me o/ [18:01:35] Jdlrobson: beta cluster patches don't need back port window [18:01:43] I can merge and rebase it for you [18:02:03] I'll deploy my own changes [18:03:46] Amir1: sorry i'll rephrase - it's a config change but will only impact beta cluster right now as the relevant changes are not live in production [18:03:50] but it should also be safe [18:03:58] (03PS1) 10Ottomata: [WIP] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) [18:05:17] OK then I'll take care of it [18:05:53] But you do have to tell me what the patch is :) [18:07:18] (03PS8) 10Dzahn: service.yaml: add releases as a service without LVS [puppet] - 10https://gerrit.wikimedia.org/r/623464 [18:08:00] RoanKattouw: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/628897 Define Chinese logo variants for Modern Vector [NEW] [18:08:00] !log add NAT rule to pfw3-codfw - T263488 [18:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:30] i've added on calendar [18:08:43] (03PS1) 10Jdlrobson: Define Chinese logo variants for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628897 (https://phabricator.wikimedia.org/T261153) [18:09:56] (03PS2) 10Jdlrobson: Define Chinese logo variants for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628897 (https://phabricator.wikimedia.org/T261153) [18:10:52] (03PS3) 10Jdlrobson: Define Chinese logo variants for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628897 (https://phabricator.wikimedia.org/T261153) [18:15:07] (03CR) 10Catrope: [C: 03+2] Define Chinese logo variants for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628897 (https://phabricator.wikimedia.org/T261153) (owner: 10Jdlrobson) [18:15:37] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10wiki_willy) Thanks for the initial diagnosis @Papaul Hi @Dzahn - per my earlier comment, are you guys ok moving forward without this server for a couple quarters, until the s... [18:15:53] (03Merged) 10jenkins-bot: Define Chinese logo variants for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628897 (https://phabricator.wikimedia.org/T261153) (owner: 10Jdlrobson) [18:18:40] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [18:18:46] Jdlrobson: Can you explain how this is a no-op? Is the code for 'variants' not in production yet? [18:19:53] RoanKattouw: so the variant key is currently unused and ignored by code in production [18:19:59] OK [18:19:59] in beta cluster it should be working [18:20:14] .. but it's not? [18:20:19] does it need some kind of sync for beta cluster? [18:21:40] !log catrope@deploy1001 Synchronized static/images/mobile/copyright/: Update Chinese logo variants for Modern Vector (T261153) (duration: 00m 56s) [18:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:45] T261153: Chinese Wikipedia’s wordmark and tagline should change based on variant selection - https://phabricator.wikimedia.org/T261153 [18:22:53] (03PS1) 10Volans: dns: add icinga check mode [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628898 [18:23:11] Jdlrobson: it doesn't need syncing it happens automatically every ten minutes or so [18:23:25] I'd say wait for a bit [18:24:03] ok now i guess i just play the waiting game :) [18:24:12] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [18:24:58] it seems the automatic deployment is broken [18:25:02] :((( [18:25:15] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Define Chinese logo variants for Modern Vector (no-op) (T261153) (duration: 00m 57s) [18:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:47] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Define Chinese logo variants for Modern Vector (no-op) (part 2) (T261153) (duration: 00m 56s) [18:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:51] T261153: Chinese Wikipedia’s wordmark and tagline should change based on variant selection - https://phabricator.wikimedia.org/T261153 [18:27:41] (03PS3) 10Ppchelko: Increase timeouts for connection to eventgate to match envoy config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622863 (https://phabricator.wikimedia.org/T249745) [18:28:54] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/25240/tools-sgebastion-08.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/628879 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [18:29:26] (03PS5) 10Catrope: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) [18:29:27] RoanKattouw: could you ping me when done with deploys please? I'll sneak one more in if there's still time [18:29:34] (03CR) 10Catrope: [C: 03+2] Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) (owner: 10Catrope) [18:30:25] (03Merged) 10jenkins-bot: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) (owner: 10Catrope) [18:31:12] Pchelolo: Will do [18:31:16] thank you [18:35:08] (03PS2) 10Catrope: Enable and configure GrowthExperiments on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627393 (https://phabricator.wikimedia.org/T255027) [18:35:51] (03CR) 10Catrope: [C: 03+2] Enable and configure GrowthExperiments on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627393 (https://phabricator.wikimedia.org/T255027) (owner: 10Catrope) [18:36:35] (03Merged) 10jenkins-bot: Enable and configure GrowthExperiments on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627393 (https://phabricator.wikimedia.org/T255027) (owner: 10Catrope) [18:42:25] 10Operations, 10Traffic, 10Performance-Team (Radar): experiment with a "unified" ATS-BE pool - https://phabricator.wikimedia.org/T263291 (10Krinkle) [18:42:57] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) @wiki_willy Yes, one server more or less does not hurt us. We can just keep it offline and wait for the refresh for now (unless more would start breaking). [18:44:31] doesn [18:44:38] n't seem to be working ... hmm [18:44:57] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) We might as well just turn this into a decom task for it. I would be ok to upload and merge patches to remove it from puppet completely and then give it back to you an... [18:45:37] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10wiki_willy) Thanks @Dzahn , works for me. [18:46:21] RoanKattouw or @Amir1 is there any way you can query the value of $wgLogos on https://zh.wikipedia.beta.wmflabs.org/zh-hans/Main_Page for me? [18:46:53] sure [18:47:58] https://www.irccloud.com/pastebin/j2r4qOxD/ [18:48:03] Jdlrobson: ^ [18:48:14] hmm so the variants key is not setting for some reason [18:48:41] (from wmgSiteLogoVariants) [18:49:01] I'm not sure if it's deployed yet (https://integration.wikimedia.org/ci/job/beta-scap-eqiad/) [18:49:05] it seems broken [18:49:12] for seven hours now [18:49:23] ah ok well thanks for confirming Amir1 that the code is reflecting the config [18:50:35] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@8afe8d2]: mjolnir daemons update I336365 [18:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable and configure GrowthExperiments on plwiki (T254239) and ptwiki (T255027) (duration: 00m 56s) [18:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:43] T254239: Deploy Growth features on Polish Wikipedia - https://phabricator.wikimedia.org/T254239 [18:53:43] T255027: Deploy Growth features on Portuguese Wikipedia - https://phabricator.wikimedia.org/T255027 [18:53:57] (03PS1) 10Dzahn: mcrouter_wancache: replace broken mw2256 with mw2257 as proxy [puppet] - 10https://gerrit.wikimedia.org/r/628905 (https://phabricator.wikimedia.org/T263065) [18:53:59] (03CR) 10Bstorm: "480 locales with this package vs. 76 in the current setup." [puppet] - 10https://gerrit.wikimedia.org/r/628879 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [18:54:08] Pchelolo: I'm done [18:54:36] thank you. I guess I can do it in 6 minutes [18:55:10] (03CR) 10Ppchelko: [C: 03+2] Increase timeouts for connection to eventgate to match envoy config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622863 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [18:55:51] (03Merged) 10jenkins-bot: Increase timeouts for connection to eventgate to match envoy config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622863 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [18:56:57] Sorry for taking so long :/ I had to fix a rebase conflict in the middle of that [18:57:29] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@8afe8d2]: mjolnir daemons update I336365 (duration: 06m 54s) [18:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:19] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:622863 T249745 (duration: 00m 56s) [18:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:23] T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 [18:59:27] done [19:00:04] mdholloway: It is that lovely time of the day again! You are hereby commanded to deploy Push notification service and config (T262936). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T1900). [19:00:04] T262936: Enable app push notifications in production - https://phabricator.wikimedia.org/T262936 [19:01:57] (03CR) 10Dzahn: [C: 03+2] mcrouter_wancache: replace broken mw2256 with mw2257 as proxy [puppet] - 10https://gerrit.wikimedia.org/r/628905 (https://phabricator.wikimedia.org/T263065) (owner: 10Dzahn) [19:05:11] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [19:05:13] (03PS1) 10Dzahn: decom mw2256, remove from conftool and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/628911 (https://phabricator.wikimedia.org/T263065) [19:08:06] (03PS1) 10Razzi: Initial debianization of python-pid [debs/python-pid] - 10https://gerrit.wikimedia.org/r/628913 (https://phabricator.wikimedia.org/T262574) [19:08:21] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) a:05Papaul→03Dzahn [19:08:45] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom mw2256 (was: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) [19:09:06] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) [19:10:02] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) Just one thing here was a bit critical, this happened to also be an mcrouter proxy. (some appservers are, most are not) and th... [19:10:58] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10hashar) [19:13:23] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [19:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] (03PS1) 10Ottomata: TEST COMMIT [debs/python-pid] - 10https://gerrit.wikimedia.org/r/628916 [19:15:24] (03Abandoned) 10Ottomata: TEST COMMIT [debs/python-pid] - 10https://gerrit.wikimedia.org/r/628916 (owner: 10Ottomata) [19:15:55] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [19:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Initial debianization of python-pid [debs/python-pid] - 10https://gerrit.wikimedia.org/r/628913 (https://phabricator.wikimedia.org/T262574) (owner: 10Razzi) [19:19:21] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [19:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:35] (03CR) 10Mholloway: [C: 03+2] Add push-notifications service config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628885 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [19:23:17] (03Merged) 10jenkins-bot: Add push-notifications service config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628885 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [19:26:11] !log mholloway-shell@deploy1001 Synchronized wmf-config/LabsServices.php: Push notifications deployment (1/5) (duration: 00m 57s) [19:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:58] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:28:09] !log mholloway-shell@deploy1001 Synchronized wmf-config/ProductionServices.php: Push notifications deployment (2/5) (duration: 00m 57s) [19:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:46] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:29:52] (03PS5) 10Mholloway: Echo: Set up common push settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) [19:31:25] (03CR) 10Mholloway: [C: 03+2] Echo: Set up common push settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [19:32:15] (03Merged) 10jenkins-bot: Echo: Set up common push settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [19:34:10] !log mholloway-shell@deploy1001 Synchronized wmf-config/CommonSettings.php: Push notifications deployment (3/5) (duration: 00m 57s) [19:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:48] (03CR) 10Mholloway: [C: 03+2] Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [19:35:27] (03PS5) 10Mholloway: Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) [19:35:44] PROBLEM - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster=prometheus instance=prometheus4001 job=thanos-sidecar prometheus=ops site=ulsfo https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [19:36:56] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [19:37:26] (03PS5) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [19:38:01] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Push notifications deployment (4/5) (duration: 00m 57s) [19:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:01] thanos alert is me please disregard [19:42:12] thanks [19:42:16] (03PS6) 10Dzahn: httpd/simplelamp2: add parameter to not purge manual config [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) [19:43:04] (03PS1) 10Ebernhardson: enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 [19:43:36] (03CR) 10jerkins-bot: [V: 04-1] enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [19:46:07] !log moving prometheus instance from bast4002 to prometheus4001 T243057 [19:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:13] (03PS2) 10Ebernhardson: enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 [19:46:13] T243057: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 [19:47:18] (03CR) 10DCausse: enwiktionary completion search ranking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [19:48:38] (03PS1) 10Razzi: Initial debianization of python-pid [debs/python-pid] (debian) - 10https://gerrit.wikimedia.org/r/628924 (https://phabricator.wikimedia.org/T262574) [19:48:48] (03PS1) 10Herron: prometheus: point prometheus.svc.ulsfo to prometheus4001 [dns] - 10https://gerrit.wikimedia.org/r/628925 (https://phabricator.wikimedia.org/T243057) [19:48:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Initial debianization of python-pid [debs/python-pid] (debian) - 10https://gerrit.wikimedia.org/r/628924 (https://phabricator.wikimedia.org/T262574) (owner: 10Razzi) [19:49:22] (03PS3) 10Ebernhardson: enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 [19:49:24] (03CR) 10Ebernhardson: enwiktionary completion search ranking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [19:50:13] (03CR) 10Cwhite: "looking good!" (032 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 (owner: 10Filippo Giunchedi) [19:51:16] RECOVERY - Thanos sidecar cannot connect to Prometheus on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [19:51:36] (03CR) 10Herron: [C: 03+2] prometheus: point prometheus.svc.ulsfo to prometheus4001 [dns] - 10https://gerrit.wikimedia.org/r/628925 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [19:56:11] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10hashar) Requested by @Ottomatta and as a Gerrit administrator, I have added @razzi to the Gerr... [20:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T2000). [20:00:06] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [20:00:06] Echo push is deployed to testwiki. I'm going to leave it there for now for testing and to make sure we're not going to surprise the communities by promoting any further. [20:01:18] (03PS2) 10Dmaza: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628927 (https://phabricator.wikimedia.org/T261249) [20:02:46] (03CR) 10Cwhite: [C: 03+1] "Increases load on the DNS servers, but is matching what the other PoP is doing. LGTM." [dns] - 10https://gerrit.wikimedia.org/r/628853 (owner: 10Herron) [20:03:02] (03CR) 10DCausse: [C: 03+1] enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [20:04:57] !log moving prometheus instance from bast3004 to prometheus3001 T243057 [20:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:02] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:05:02] T243057: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 [20:05:06] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:51] (03PS1) 10Razzi: Add python3-pid debian package to reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/628929 (https://phabricator.wikimedia.org/T262574) [20:08:00] (03PS1) 10Ebernhardson: Remove pages from completion search by page id [extensions/CirrusSearch] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628821 [20:08:10] (03CR) 10Ebernhardson: [C: 03+2] Remove pages from completion search by page id [extensions/CirrusSearch] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628821 (owner: 10Ebernhardson) [20:08:50] (03CR) 10Razzi: "https://puppet-compiler.wmflabs.org/compiler1003/25247/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/628929 (https://phabricator.wikimedia.org/T262574) (owner: 10Razzi) [20:09:32] (03CR) 10Ottomata: [C: 03+1] Add python3-pid debian package to reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/628929 (https://phabricator.wikimedia.org/T262574) (owner: 10Razzi) [20:09:39] (03CR) 10Razzi: [C: 03+2] Add python3-pid debian package to reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/628929 (https://phabricator.wikimedia.org/T262574) (owner: 10Razzi) [20:17:11] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10Ottomata) Thank you! [20:21:01] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) [20:22:33] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [20:23:07] (03CR) 10Tjones: [C: 03+1] "I don't have +2 on this repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [20:24:24] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:16] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [20:25:57] PROBLEM - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster=prometheus instance=prometheus3001 job=thanos-sidecar prometheus=ops site=esams https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [20:26:14] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:30:57] 10Operations, 10SRE-Access-Requests: Allow Nicholas Skaggs to issue icinga commands - https://phabricator.wikimedia.org/T263191 (10crusnov) I see the patch is already merged, @nskaggs please test icinga command and followup so we can close ticket. Thanks! [20:33:02] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10crusnov) This has been approved in the team meeting and followed up on IRC. Let us know if there's anything that needs to be done further. [20:35:19] (03Merged) 10jenkins-bot: Remove pages from completion search by page id [extensions/CirrusSearch] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628821 (owner: 10Ebernhardson) [20:35:48] (03PS2) 10CDanis: geo-resources: create text-next for NEL [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) (owner: 10BBlack) [20:35:50] (03PS1) 10CDanis: point intake-logging.wikimedia.org to text-next (second-best DC) [dns] - 10https://gerrit.wikimedia.org/r/628935 (https://phabricator.wikimedia.org/T261340) [20:39:20] (03PS4) 10Ebernhardson: enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 [20:39:28] (03CR) 10Ebernhardson: [C: 03+2] enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [20:39:39] (03PS3) 10CDanis: geo-resources: create text-next for NEL [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) (owner: 10BBlack) [20:39:41] (03PS2) 10CDanis: point intake-logging.wikimedia.org to text-next (second-best DC) [dns] - 10https://gerrit.wikimedia.org/r/628935 (https://phabricator.wikimedia.org/T261340) [20:40:40] (03Merged) 10jenkins-bot: enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628923 (owner: 10Ebernhardson) [20:47:10] !log ebernhardson@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/CirrusSearch/: Remove pages from completion search by page id (duration: 01m 00s) [20:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:52] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: adjust enwiktionary completion search ranking (duration: 00m 57s) [20:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:08] I'm all done [20:54:31] (03PS1) 10Herron: role::bastionhost::pop: remove prometheus instances [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) [20:59:01] (03CR) 10Herron: "fwiw looked into moving these over to role::bastionhost::general but I think we'd break tftp in the pops doing that" [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [20:59:53] (03PS1) 10Ebernhardson: Adjust enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628941 [21:00:04] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T2100). [21:03:00] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [21:11:37] (03PS1) 10Alexandros Kosiaris: otrs: Support OTRS6 in otrs.TicketExport2Mbox.pl [puppet] - 10https://gerrit.wikimedia.org/r/628945 [21:12:00] (03PS1) 10Andrew Bogott: ceph: add firewall rules for cloudcontroller nodes [puppet] - 10https://gerrit.wikimedia.org/r/628946 (https://phabricator.wikimedia.org/T263461) [21:12:21] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [21:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25244/" [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [21:13:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Support OTRS6 in otrs.TicketExport2Mbox.pl [puppet] - 10https://gerrit.wikimedia.org/r/628945 (owner: 10Alexandros Kosiaris) [21:13:47] (03CR) 10Andrew Bogott: [C: 03+2] ceph: add firewall rules for cloudcontroller nodes [puppet] - 10https://gerrit.wikimedia.org/r/628946 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [21:14:12] akosiaris: m-m-m-multi merge. i can do both? [21:15:28] mutante: please do :-) [21:15:38] * mutante triple merges [21:17:25] (03PS3) 10Dzahn: add dns-disc for releases servers [dns] - 10https://gerrit.wikimedia.org/r/623465 [21:18:09] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:04] ACKNOWLEDGEMENT - Host mw2256.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn broken hardware - known [21:21:50] (03PS1) 10Cwhite: to be abandoned: disable logrotate on graphite [puppet] - 10https://gerrit.wikimedia.org/r/628948 [21:22:43] (03CR) 10Dzahn: [C: 03+2] decom mw2256, remove from conftool and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/628911 (https://phabricator.wikimedia.org/T263065) (owner: 10Dzahn) [21:23:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [21:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:53] (03CR) 10Cwhite: [C: 03+1] graphite-carbon: disable internal log rotation and use logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628423 (https://phabricator.wikimedia.org/T263103) (owner: 10Herron) [21:25:24] (03Abandoned) 10Cwhite: to be abandoned: disable logrotate on graphite [puppet] - 10https://gerrit.wikimedia.org/r/628948 (owner: 10Cwhite) [21:26:24] (03CR) 10Herron: "ha! amazing. thanks for having a look" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628423 (https://phabricator.wikimedia.org/T263103) (owner: 10Herron) [21:26:37] volans: fyi, just used decom cookbook.. it is at the "generating DNS records step [21:28:53] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [21:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:00] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2256.codfw.wmnet` - mw2256.codfw.wmnet... [21:30:30] volans: is this kind of change affected by netbox generation? https://gerrit.wikimedia.org/r/c/operations/dns/+/623465/3/templates/wmnet [21:31:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 220 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:33:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:33:38] (03PS1) 10Dzahn: decom mw2256.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/628949 (https://phabricator.wikimedia.org/T263065) [21:34:55] (03CR) 10Dzahn: [C: 03+2] decom mw2256.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/628949 (https://phabricator.wikimedia.org/T263065) (owner: 10Dzahn) [21:40:49] (03PS1) 10Mholloway: Push notifications: Increase log level to debug for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/628951 (https://phabricator.wikimedia.org/T262936) [21:45:28] (03CR) 10Mholloway: [C: 03+2] Push notifications: Increase log level to debug for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/628951 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [21:47:43] (03Merged) 10jenkins-bot: Push notifications: Increase log level to debug for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/628951 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [21:51:17] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [21:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:45] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 52.88 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [21:54:31] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [21:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:44] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @wiki_willy @Papaul Removed from everything and ran the decom cookbook on it. It's now in state decom in netbox. https://net... [21:56:04] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) a:05Dzahn→03None [21:56:13] PROBLEM - glance-api http on cloudcontrol1003 is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:56:23] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [21:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:24] ACKNOWLEDGEMENT - glance-api http on cloudcontrol1003 is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 0.001 second response time andrew bogott T263461 in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:57:27] 10Operations, 10SRE-Access-Requests: Allow Nicholas Skaggs to issue icinga commands - https://phabricator.wikimedia.org/T263191 (10Dzahn) 05Open→03Resolved a:03Dzahn He already confirmed it as well :) 17:43 < balloons> mutante, I successfully scheduled some downtime. Thanks! [21:58:16] (03PS1) 10Andrew Bogott: OpenStack Glance: fixes to glance-api.conf [puppet] - 10https://gerrit.wikimedia.org/r/628953 (https://phabricator.wikimedia.org/T263461) [21:59:04] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Glance: fixes to glance-api.conf [puppet] - 10https://gerrit.wikimedia.org/r/628953 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [21:59:39] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:00:17] RECOVERY - glance-api http on cloudcontrol1003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 1071 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:01:07] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:05:52] (03CR) 10Dzahn: [C: 03+2] add dns-disc for releases servers [dns] - 10https://gerrit.wikimedia.org/r/623465 (owner: 10Dzahn) [22:05:57] (03PS4) 10Dzahn: add dns-disc for releases servers [dns] - 10https://gerrit.wikimedia.org/r/623465 [22:11:58] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [22:12:15] (03PS1) 10Dzahn: service: add trailing slash to monitoring check for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/628954 [22:12:48] (03CR) 10Dzahn: [C: 03+2] service: add trailing slash to monitoring check for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/628954 (owner: 10Dzahn) [22:18:13] (03CR) 10Dzahn: "after this all green on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=releases&scroll=692" [puppet] - 10https://gerrit.wikimedia.org/r/628954 (owner: 10Dzahn) [22:20:44] !log releases.wikimedia.org has been converted to an active-active service with geodns/ backends in both DCs [22:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:14] PROBLEM - Check systemd state on prometheus3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:19] (03Abandoned) 10Ebernhardson: Adjust enwiktionary completion search ranking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628941 (owner: 10Ebernhardson) [22:47:44] (03PS1) 10Mholloway: Push notifications: Increase log level to trace for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/628962 (https://phabricator.wikimedia.org/T262936) [22:48:21] (03PS1) 10Dzahn: service/planet: turn planet into an active-active service using discovery [puppet] - 10https://gerrit.wikimedia.org/r/628963 [22:50:44] (03CR) 10Mholloway: [C: 03+2] Push notifications: Increase log level to trace for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/628962 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [22:52:55] (03PS1) 10Dzahn: add discovery records for planet [dns] - 10https://gerrit.wikimedia.org/r/628964 [22:53:10] (03Merged) 10jenkins-bot: Push notifications: Increase log level to trace for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/628962 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [22:55:16] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [22:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:02] (03PS1) 10Dzahn: switch debmonitor to discovery records [dns] - 10https://gerrit.wikimedia.org/r/628965 [22:57:15] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [22:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:10] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [22:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200921T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:05:28] (03PS1) 10Dzahn: service/debmonitor: turn debmonitor into an active-active service [puppet] - 10https://gerrit.wikimedia.org/r/628966 [23:08:36] (03CR) 10Samwilson: [C: 03+1] Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628927 (https://phabricator.wikimedia.org/T261249) (owner: 10Dmaza) [23:12:40] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:01] (03PS2) 10Dzahn: service/planet: turn planet into an active-active service using discovery [puppet] - 10https://gerrit.wikimedia.org/r/628963 (https://phabricator.wikimedia.org/T263506) [23:24:04] (03PS2) 10Dzahn: add discovery records for planet [dns] - 10https://gerrit.wikimedia.org/r/628964 (https://phabricator.wikimedia.org/T263506) [23:24:09] (03PS2) 10Dzahn: switch debmonitor to discovery records [dns] - 10https://gerrit.wikimedia.org/r/628965 (https://phabricator.wikimedia.org/T263506) [23:24:18] (03PS2) 10Dzahn: service/debmonitor: turn debmonitor into an active-active service [puppet] - 10https://gerrit.wikimedia.org/r/628966 (https://phabricator.wikimedia.org/T263506) [23:24:46] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T263506" [dns] - 10https://gerrit.wikimedia.org/r/623465 (owner: 10Dzahn) [23:24:50] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T263506" [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [23:27:02] (03PS1) 10Mholloway: Push notifications: Drop log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/628967 [23:29:51] (03CR) 10Mholloway: [C: 03+2] Push notifications: Drop log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/628967 (owner: 10Mholloway) [23:30:06] (03CR) 10Mholloway: [C: 04-2] Push notifications: Drop log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/628967 (owner: 10Mholloway) [23:30:26] (03PS2) 10Mholloway: Push notifications: Drop log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/628967 [23:32:06] ACKNOWLEDGEMENT - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-10-18 09:02:07 +0000 (expires in 26 days) daniel_zahn https://phabricator.wikimedia.org/T261419 https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [23:32:07] (03PS1) 10Jeena Huneidi: This is a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/628968 [23:32:53] (03CR) 10Jeena Huneidi: [C: 04-2] This is a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/628968 (owner: 10Jeena Huneidi) [23:32:55] (03CR) 10Mholloway: [C: 03+2] Push notifications: Drop log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/628967 (owner: 10Mholloway) [23:33:31] ACKNOWLEDGEMENT - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-10-18 09:02:07 +0000 (expires in 26 days) daniel_zahn https://phabricator.wikimedia.org/T261419 https://phabricator.wikimedia.org/tag/phabricator/ [23:35:10] (03Merged) 10jenkins-bot: Push notifications: Drop log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/628967 (owner: 10Mholloway) [23:36:23] !log debmonitor2002 - systemctl reset-failed [23:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:40] RECOVERY - Check systemd state on debmonitor2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:39] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [23:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:42] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [23:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:28] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25246/" [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) (owner: 10Dzahn) [23:42:32] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [23:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:02] (03PS2) 10Dzahn: openstack: replace remaining hiera() that had default values [puppet] - 10https://gerrit.wikimedia.org/r/627967 [23:52:20] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Papaul) @Dzahn Thanks [23:53:55] (03PS1) 10Dzahn: kafka::certificate: add data types, hiera()->lookup() [puppet] - 10https://gerrit.wikimedia.org/r/628969