[00:09:05] (03PS3) 10Krinkle: Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [00:09:19] (03CR) 10Krinkle: [C: 03+1] Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [00:09:33] (03PS4) 10Krinkle: Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [00:16:06] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [00:19:01] (03CR) 10Krinkle: [C: 03+2] Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [00:19:46] (03Merged) 10jenkins-bot: Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [00:20:58] AaronSchulz: staged on mwdebug1002 [00:21:57] ok [00:23:01] I think I just experienced the first time ever in 10 years my user action in prod being denied with a lag warning [00:23:38] I saved my preferences on test2wiki and got this: https://usercontent.irccloud-cdn.com/file/AHG0FHq8/db-readonly.png [00:23:43] also interesting footer appendix [00:23:47] never actaully seen that in prod before [00:23:50] coincidence? [00:24:18] twice in a row [00:24:25] getting suspicious [00:24:38] Krinkle: could be something like "busyValue" [00:24:53] we are the only ones populating the cache atm [00:25:20] (03CR) 10Papaul: [C: 04-1] "You need to remove also asset tag entries in the wmnet file" [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [00:25:24] I am unable to edit anything via mwdebuyg [00:25:28] for several seconds [00:25:28] edit form looks normal to me [00:25:34] Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later. [00:25:38] https://test2.wikipedia.org/w/index.php?title=Sandbox&action=edit [00:25:44] still to this minute [00:25:47] * AaronSchulz is on enwiki [00:26:13] I was able to edit test2 - https://test2.wikipedia.org/w/index.php?title=Sandbox&diff=431348&oldid=431347 [00:26:25] DannyS712: XWD with mwdebug1002? [00:26:46] test2 wfm too [00:26:54] (edited my user page) [00:27:49] tried again with XWD with `mwdebug1002.eqiad.wmnet` and it worked [00:27:54] yeah it's over now [00:28:27] AaronSchulz: is lag-based readonly mode induced by default if it can't compute it due to busy Value? That feles like a dangeours default. [00:28:41] I'm gonna undo and re-apply in a minute to see if it happens again [00:29:25] I've pulled down wmf-config from before this patch on mwdebug1002 now [00:31:46] and applyhing again now [00:32:40] Again database being locked [00:32:54] continuously and consistently for at least a full minute [00:33:01] footer: Warning: Page may not contain recent updates. [00:33:05] The system administrator who locked it offered this explanation: The database is read-only until replication lag decreases. [00:33:10] can't do anything [00:34:05] I'll do it one more time with verbose logging enabled to see if anything stands out in the logs [00:35:38] I saw the lock notice for a bit, gone now [00:38:12] I've pulled down wmf-config from before this patch on mwdebug1002 (again)( [00:42:31] and aplying again [00:45:53] https://logstash.wikimedia.org/app/kibana#/dashboard/x-debug?_g=(time:(from:now-1h,mode:quick,to:now))&_a=(query:(query_string:(query:%27reqId:%224c083eba-f1dc-4ece-b691-ec06cd402159%22%27))) [00:47:07] I don't see anything about how it concluded it is lagged [00:47:12] but it does say "Lagged DB used; CDN cache TTL limited to 30 seconds" [00:50:28] another one via reqId:"db0cfc97-607b-4192-88fd-2e0857eea904" [00:50:35] first one is speical preferences, the other action=edit [00:50:40] in case it has better logs [00:51:00] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:08] fetchOrRegenerate(global:rdbms-server-states:1:db1120:0-1-2): miss, new value computed [00:51:14] I saw this on both of them [00:51:19] so that suggests it did compute a new value [00:51:28] I assuming this key relates to the read-only lagged stuf, reight? [00:52:09] es [00:52:12] *yes [00:53:19] fetchOrRegenerate(global:rdbms-server-states:1:db1086:0-1-2-3-4-5-6): miss, new value computed [00:53:28] earlier on from the same request [00:54:36] 'lagTimes' => array_fill_keys( $serverIndexes, 0 ), [00:54:44] this suggests, that as expected, lagTimes defaults to 0 [00:54:49] which is busyValue [00:55:17] I don't know if it hit that case, but I'd expect it to behave as normal if it did, not as read-only, right? [00:55:23] I'd rather not incur a global 2-min readonly :) [00:57:23] fetchOrRegenerate(global:rdbms-server-readonly:db1123:test2wiki🙂 hit with async refresh [00:57:28] getWithSetCallback(global:rdbms-server-readonly:db1123:test2wiki🙂 process cache hit [00:57:29] getWithSetCallback(global:rdbms-server-readonly:db1123:test2wiki🙂 process cache hit [00:57:44] that was on the action=edit, which was the second request [00:58:22] my special:pref request 3 seconds earlier got the same though: [00:58:23] fetchOrRegenerate(global:rdbms-server-readonly:db1123:test2wiki🙂 hit with async refresh [00:58:32] so I guess someone else had it populated already [00:58:47] OK, I'll try one more time flip-flopping and trying to be one with with verbose logging on. [00:58:58] maybe you two can flip that on as well if you're browsing around :) [01:03:22] okay I got the miss captured now [01:03:29] AaronSchulz: any theories so far? [01:10:55] Krinkle: I wonder if multiple requests (e.g. assets cause some fun with busyvalue/locktse) [01:15:41] AaronSchulz: https://etherpad.wikimedia.org/p/CZg-suME0Sx-nnCfKIEW [01:16:43] AaronSchulz: but where are the supposed lag values coming from? [01:16:45] probably more busyValue than lockTSE. The main Doc should be first-ish, though maybe some async locks trigger busyValue. It should cache in APC for 60 min and reuse if possible though (and staleness does not count as lag). Still odd. [01:20:41] * AaronSchulz checks https://test2.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= [01:21:04] OK, I'll do it again [01:21:46] I've pulled back the change, waiting 1+min [01:25:25] and applied again [01:25:34] able to reproduce the locked preferences page [01:25:58] dbrepllag API consistently 0 for the master and < 0.9 for the others nothing stands out [01:33:30] Krinkle: I guess we can try later? [01:33:40] Wikimedia\Rdbms\LoadMonitor::getServerStates: regenerated 'db1123' cluster status [01:33:52] I didn't realise before that "cluster status" was the detrailed logging from these cases [01:34:07] so we can see it's not hitting busyValue etc it's doing a normal fresh computation during this request [01:34:14] AaronSchulz: Yeah, I'll rollback for now [01:34:23] getting later here :) [01:34:48] it's probably due to low traffic or something, but I don't really understand it atm [01:35:32] I'd expect less than a minute of oddness (maybe a few seconds of fleeting read-only) [01:35:49] (03PS1) 10Krinkle: Revert "Enable "coalesceKeys" for global keys for WANCache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604211 [01:36:02] (03CR) 10Krinkle: [C: 03+2] Revert "Enable "coalesceKeys" for global keys for WANCache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604211 (owner: 10Krinkle) [01:36:50] (03Merged) 10jenkins-bot: Revert "Enable "coalesceKeys" for global keys for WANCache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604211 (owner: 10Krinkle) [01:37:33] AaronSchulz: we don't send first byte until server is completely done, the secondary requests started 1-2 seconds after the main one was finished. [01:37:44] and it seems to have all the clean cases for miss and perfect condition [01:37:51] so definititely something up [01:38:16] so in general for cdn-miss, there is no overlap between html response and asset requests, except post-send maybe [01:44:24] I added a bit more to the etherpad for the next GET request, which also hit the fresh state as it picked a different db [01:44:37] and that request also hit the same lagged state [01:44:43] okay, good night :) [02:05:46] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:00:10] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10vhargyono-WMF) [04:54:01] (03PS1) 10Marostegui: db2113: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604240 (https://phabricator.wikimedia.org/T251570) [04:57:52] (03CR) 10Marostegui: [C: 03+2] db2113: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/604240 (https://phabricator.wikimedia.org/T251570) (owner: 10Marostegui) [05:10:54] !log Deploy schema change on s3 master with 2 minutes sleep between wikis - T206103 [05:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:59] T206103: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 [06:04:14] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10Aklapper) Hi @vhargyono-WMF! Do you already have shell access? [06:05:30] 10Operations, 10Analytics, 10Analytics-Kanban, 10observability, 10Patch-For-Review: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10elukey) @Ottomata I realized today that the issue pointed out by Marcel during standup (namely logs... [06:08:08] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:16] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 60, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:48] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:45] there is a Telia and a Zayo maintenance scheduled [06:27:52] ah no in this case it is all Telia [06:28:04] ulsfo <-> eqord [06:28:11] codfw <-> eqord [06:28:37] but all mentioned in the maintenance msg, so goood [06:34:35] (03CR) 10JMeybohm: [C: 03+1] prometheus: enable Thanos upload for k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [06:42:34] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) I'll take that as a reminder to check the MediaWiki version i... [06:53:00] !log trunk public vlan to ulsfo ganeti hosts - T254157 [06:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:04] T254157: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 [07:04:31] (03CR) 10Muehlenhoff: [C: 03+2] wmf_auto_reimage: Use systemd unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/604017 (owner: 10Muehlenhoff) [07:05:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2113 after on-site maintenance T251570', diff saved to https://phabricator.wikimedia.org/P11438 and previous config saved to /var/cache/conftool/dbconfig/20200610-070508-marostegui.json [07:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:12] T251570: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 [07:08:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103 for reimage - T253217', diff saved to https://phabricator.wikimedia.org/P11439 and previous config saved to /var/cache/conftool/dbconfig/20200610-070822-marostegui.json [07:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:27] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [07:10:30] (03PS1) 10Marostegui: mariadb: Move db1103 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/604284 (https://phabricator.wikimedia.org/T253217) [07:12:02] (03CR) 1020after4: [C: 03+1] ATS/phabricator: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:15:23] !log upgrade remaining API servers in eqiad to PHP 7.2.31 [07:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:20] !log trunk public vlan to eqsin ganeti hosts - T254157 [07:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:23] T254157: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 [07:26:50] !log trunk public vlan to esams ganeti hosts - T254157 [07:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:54] T254157: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 [07:28:35] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10ayounsi) a:05ayounsi→03akosiaris All yours! [07:30:41] (03CR) 10JMeybohm: [C: 03+2] eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [07:31:12] (03Merged) 10jenkins-bot: eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [07:31:54] !log upgrade mw1298-mw1309 (job runners) to PHP 7.2.31 [07:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:47] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [07:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:04] !log make asw2-ulsfo interfaces Homer like - T250429 [07:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:08] T250429: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 [07:36:13] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [07:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:04] (03CR) 10Ema: [C: 03+2] cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [07:43:01] 10Puppet, 10Wikimedia Meet, 10Patch-For-Review: Puppetize the meet account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) [07:49:32] (03PS1) 10Ema: cloud hieradata: specify etcd::autogen_pwd_seed [puppet] - 10https://gerrit.wikimedia.org/r/604294 [07:51:58] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1077.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-rei... [07:52:41] !log reimaging db1077 T252027 [07:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:38] T252027: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 [07:57:46] (03PS1) 10Ema: cloud: add traffic-cache-atstext-buster hieradata [puppet] - 10https://gerrit.wikimedia.org/r/604297 [08:03:17] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/597998 (owner: 10Hashar) [08:03:22] (03CR) 10Hashar: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/597998 (owner: 10Hashar) [08:03:24] (03CR) 10Dzahn: "@Papaul Ok, i will amend. I am doing it this way out of habit from the workflows when these were left at first, but i know in this case it" [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [08:04:54] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1077.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1077.eqiad.wmnet'] ` [08:04:54] (03PS1) 10Ema: cloud: add traffic-cache-atsupload-buster hieradata [puppet] - 10https://gerrit.wikimedia.org/r/604300 [08:05:27] (03PS2) 10Dzahn: remove mgmt IPs for mw2150 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) [08:05:41] (03CR) 10Ema: [C: 03+2] cloud hieradata: specify etcd::autogen_pwd_seed [puppet] - 10https://gerrit.wikimedia.org/r/604294 (owner: 10Ema) [08:05:54] (03CR) 10Ema: [C: 03+2] cloud: add traffic-cache-atstext-buster hieradata [puppet] - 10https://gerrit.wikimedia.org/r/604297 (owner: 10Ema) [08:06:08] (03CR) 10Ema: [C: 03+2] cloud: add traffic-cache-atsupload-buster hieradata [puppet] - 10https://gerrit.wikimedia.org/r/604300 (owner: 10Ema) [08:06:41] (03CR) 10Dzahn: [C: 03+2] remove mgmt IPs for mw2150 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [08:06:45] (03PS3) 10Dzahn: remove mgmt IPs for mw2150 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) [08:08:58] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) @Papaul done! Do you still have anything to do here on your side? [08:10:00] (03PS4) 10Dzahn: base/monitoring: allow setting different contactgroup for systemd [puppet] - 10https://gerrit.wikimedia.org/r/602052 [08:12:45] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1077.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-rei... [08:12:56] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/597998 (owner: 10Hashar) [08:13:14] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:21] (03Abandoned) 10Hashar: Update debian/changelog to point to Buster [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [08:13:40] (03CR) 10Kormat: [C: 03+1] mariadb: Move db1103 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/604284 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [08:14:00] (03Abandoned) 10Hashar: Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) (owner: 10Hashar) [08:14:56] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23138/labstore1006.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [08:14:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:59] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:22] ಠ_ಠ wmf-auto-reimage --no-downtime failed on.. downtime [08:16:26] (03PS2) 10Ema: Switch backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [08:17:20] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [08:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:40] 10Operations, 10Maps (Maps-data), 10Product-Infrastructure-Team-Backlog (Kanban): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10Gehel) a:05Gehel→03RKemper [08:17:43] kormat: yes, and it's not a contradiction [08:18:06] volans: this is surprising. explain :) [08:18:23] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 (10Gehel) a:05Gehel→03None [08:19:19] (03PS1) 10Vgutierrez: cloud: Add prioritize_chacha to traffic-cache-atsupload-buster [puppet] - 10https://gerrit.wikimedia.org/r/604303 [08:19:23] so, --no-downtime doesn't perform the first downtime, *before* the reimage. But once we are past d-i and start the first puppet run, we *always* downtime the whole host because during the first puppet run things will not yet be ready but the checks will be already exported to puppetdb and picked by icinga at a random time [08:19:43] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Make elasticsearch configuration more robust to loss of network connectivity - https://phabricator.wikimedia.org/T143552 (10Gehel) a:05Gehel→03None [08:19:54] so the script launches in background (with a small delay) a downtime cookbook with the option to force the puppet run on the icinga host to pick the new checks [08:19:56] hum. sounds like `--no-downtime` should be named `--no-initial-downtime` or something [08:19:58] and then performs the downtime [08:20:24] all the options and sequence of actions are docmented in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reimage [08:20:45] any idea why downtiming failed? https://phabricator.wikimedia.org/P11440 [08:21:53] the downtime script on icinga host exited with 2 [08:21:57] (03CR) 10Vgutierrez: [C: 03+2] cloud: Add prioritize_chacha to traffic-cache-atsupload-buster [puppet] - 10https://gerrit.wikimedia.org/r/604303 (owner: 10Vgutierrez) [08:22:11] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1077.eqiad.wmnet'] ` and were **ALL** successful. [08:22:28] the usual culprits are: 1) no definition of the hosts in icinga, usually due to not matching anything in site.pp (that will also make puppet fail) [08:22:48] 2) unable to run puppet on the icinga host even with the 30 attempts, hence it times out [08:23:52] but le me check [08:25:22] i've updated the wiki page to mention the caveat about --no-downtime [08:25:35] kormat: in this case the puppet run failed [08:27:33] how can you tell? [08:27:36] line 28 of your paste [08:27:51] ah, i see [08:27:52] also the fact that the exception was right after "Running Puppet..." [08:28:13] the timestamp of the irc notification exactly matches this in /var/log/puppet.log: [08:28:24] `Jun 10 08:14:59 icinga1001 puppet-agent[106079]: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists)` [08:28:45] there should me multiple of them [08:28:56] as the script should retry [08:29:05] the puppet run in icinga are super slow [08:30:32] volans: should a single failure cause a traceback to be shown? [08:30:44] it's not a single failure [08:31:12] the cookbook runs 'run-puppet-agent --quiet --attempts 30' and that fialed [08:31:15] *failed [08:31:43] i get that it is supposed to try 30 times [08:31:51] but i'm trying to confirm that it actually did that [08:31:54] I'm checking if by any chance that behaviour is not what we expect anymore in the run-puppet-agent script [08:31:56] and logs are not being helpful [08:32:33] I know and that's due to another long story (will tell later) that should be solved very soon (with the buster migration of cumin hosts) [08:33:01] look at wait_for_puppet in ./modules/base/files/puppet/puppet-common.sh [08:35:13] the script doesn't actually try to run puppet, checks if it's running [08:35:27] (03PS1) 10Ema: ATS: handle frontend healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 [08:36:18] kormat: so that's conflicted with this run: https://puppetboard.wikimedia.org/report/icinga1001.wikimedia.org/02ed09fd0a48de1e61bbf93b64a39867a952bcb4 [08:36:44] although that didn't last 5 minutes... so wondering why it timed out [08:37:19] and there was another one 30s after [08:37:51] my theory: the normal puppet run started at the exact same time as the wmf-auto-reimage triggered run [08:37:59] to the second [08:38:04] and it's a race [08:38:24] you mean that the run-puppet-agent didn't actually wait but tried to run puppet and hence failed? [08:38:30] yep [08:39:23] puppet_is_running() is definitely racy [08:39:28] I guess it's possible indeed [08:39:48] surely not atomic [08:41:44] kormat: what's more worrying is the bios settings in your paste though ;) [08:42:23] (03PS1) 10Dzahn: Revert "base/monitoring: allow setting different contactgroup for systemd" [puppet] - 10https://gerrit.wikimedia.org/r/604308 [08:45:08] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [08:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] we could improve the script and convert it to error when we are confident the PXE bit is still set [08:45:53] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [08:45:53] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [08:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:13] volans: that seems to be "normal" (as in i'm used to seeing that) [08:50:46] !log make asw1-eqsin interfaces Homer like - T250429 [08:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:49] T250429: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 [08:51:26] kormat: it says the force PXE bit was not reset [08:51:48] volans: the machine booted off disk v0v [08:52:00] i think the script's detection is wrong [08:52:38] that's the output of an IPMI command verbatim [08:53:04] and yes, bmc are bogus [08:53:13] XioNoX: the log message out of context is so amusing for somebody that grew up with the Simpsons [08:53:30] hahah :) [08:54:05] I might have written it with that in mind :) [08:56:01] (03PS1) 10Ema: cloud: fe_vcl_config details for traffic-cache [puppet] - 10https://gerrit.wikimedia.org/r/604311 [08:56:33] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1077.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-rei... [08:59:22] (03PS2) 10Ema: cloud: fe_vcl_config details for traffic-cache [puppet] - 10https://gerrit.wikimedia.org/r/604311 [09:00:19] (03PS3) 10Ema: cloud: fe_vcl_config details for traffic-cache [puppet] - 10https://gerrit.wikimedia.org/r/604311 [09:00:59] (03CR) 10Ema: [C: 03+2] cloud: fe_vcl_config details for traffic-cache [puppet] - 10https://gerrit.wikimedia.org/r/604311 (owner: 10Ema) [09:03:40] there is db2095 with a soft error [09:04:07] it went away [09:08:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:31] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:55] kormat: again? [09:10:18] `Skipping run of Puppet configuration client; administratively disabled (Reason: 'temp. dzahn applying a change');` [09:10:21] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10vhargyono-WMF) @Aklapper hi! I'm not sure, but I was asked to create a shell account name when I first created the Wikiteach account. The shell account name is vhargyono [09:10:42] eh... this one is harder to avoid :) [09:12:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1103 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/604284 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [09:14:17] !log T254581 disabling puppet on all mw, api and jobrunner servers to move termbox envoy config to TLS [09:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:22] T254581: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 [09:16:39] (03CR) 10JMeybohm: [C: 03+2] services_proxy: switch termbox to TLS [puppet] - 10https://gerrit.wikimedia.org/r/604062 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [09:17:26] (03PS40) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [09:17:52] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1077.eqiad.wmnet'] ` and were **ALL** successful. [09:19:04] (03CR) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:19:39] kormat: i re-enabled it, if that helps [09:19:45] i gotta revert something [09:20:06] mutante: well apparently that puppet run didn't matter, things succeeded on my end :) [09:20:13] kormat: alright [09:20:58] (03PS2) 10Dzahn: Revert "base/monitoring: allow setting different contactgroup for systemd" [puppet] - 10https://gerrit.wikimedia.org/r/604308 [09:22:50] (03PS1) 10Filippo Giunchedi: Add v6 for thanos-be1* [dns] - 10https://gerrit.wikimedia.org/r/604314 (https://phabricator.wikimedia.org/T252186) [09:23:49] (03CR) 10Filippo Giunchedi: [C: 03+2] Add v6 for thanos-be1* [dns] - 10https://gerrit.wikimedia.org/r/604314 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:24:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1103:3312 and db1103:3314', diff saved to https://phabricator.wikimedia.org/P11441 and previous config saved to /var/cache/conftool/dbconfig/20200610-092406-marostegui.json [09:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:56] (03PS1) 10Kormat: install_server: Fix reuse-db.cfg recipe [puppet] - 10https://gerrit.wikimedia.org/r/604315 (https://phabricator.wikimedia.org/T252027) [09:25:10] 10Operations, 10Traffic, 10Patch-For-Review: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10ema) [09:26:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1103 to dbctl, depooled T253217', diff saved to https://phabricator.wikimedia.org/P11442 and previous config saved to /var/cache/conftool/dbconfig/20200610-092603-marostegui.json [09:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:07] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [09:26:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:08] (03CR) 10Marostegui: [C: 03+1] install_server: Fix reuse-db.cfg recipe [puppet] - 10https://gerrit.wikimedia.org/r/604315 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [09:27:57] (03PS2) 10Ema: ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 [09:28:38] (03PS2) 10Kormat: install_server: Fix reuse-db.cfg recipe [puppet] - 10https://gerrit.wikimedia.org/r/604315 (https://phabricator.wikimedia.org/T252027) [09:30:08] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1001/23139/" [puppet] - 10https://gerrit.wikimedia.org/r/604305 (owner: 10Ema) [09:30:35] (03CR) 10Kormat: [C: 03+2] install_server: Fix reuse-db.cfg recipe [puppet] - 10https://gerrit.wikimedia.org/r/604315 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [09:31:06] !log configure thanos-be1* HDDs as raid0 - T252186 [09:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:09] T252186: Deploy Thanos (Prometheus long-term storage) stateful components - https://phabricator.wikimedia.org/T252186 [09:31:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:27] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [09:34:02] (03CR) 10Volans: [C: 03+1] "LGTM for this first version, I'm sure we'll have some follow up to improve it." [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:34:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 for cloning db1103 - T253217', diff saved to https://phabricator.wikimedia.org/P11443 and previous config saved to /var/cache/conftool/dbconfig/20200610-093440-marostegui.json [09:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:44] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [09:34:53] 10Operations: reuse-parts.sh: provide feedback to user when something fails - https://phabricator.wikimedia.org/T254982 (10Kormat) [09:35:42] !log Stop mysql on db1127 to clone db1103 [09:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:47] !log imported 0.0.38-1+deb10u1 into buster-wikimedia APT - T245114 [09:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:51] T245114: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 [09:37:15] (Cannot access the database: Cannot access the database: No working replica DB server: Unknown error (10.64.0.97)) [09:37:25] Doing any maintainence? [09:37:31] same messages in incubator [09:37:31] ShakespeareFan00: I depooled that host [09:37:39] ahha.. [09:37:39] Problems in eswiki also [09:37:46] Ah [09:37:47] gah [09:37:51] stopped the wrong one [09:37:52] fixing it now [09:37:59] (Accès à la base de données impossible : Cannot access the database: No working replica DB server: Unknown error (10.64.0.97)) on fr.wikipedia.org [09:38:03] 10Operations, 10Patch-For-Review: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Kormat) 05Open→03Resolved a:03Kormat I'm pronouncing this Resolved \o/ I've successfully reimaged sretest1002 and db1007 using reuse-parts. We're n... [09:38:07] that's db1127.eqiad.wmnet [09:38:10] Cannot access the database: No working replica DB server: Unknown error (10.64.0.97) [09:38:22] should be fixed now [09:38:33] 10Operations, 10Puppet, 10netbox, 10cloud-services-team (Kanban): Netbox missing physical device in PuppetDB when Puppet disabled for too long - https://phabricator.wikimedia.org/T254986 (10ayounsi) p:05Triage→03Low [09:38:34] fixed for me :) [09:38:43] it is confusing as we have db1127 and db1137 [09:38:44] confirmed fixed [09:38:47] I was looking up String Theory, I'm thinking this might be the problem :P [09:38:48] works now just as I was about to complain [09:38:50] I depooled one, but stopped the other :) [09:39:19] thanks everyone for the quick heads up :) [09:39:40] (03PS3) 10Dzahn: Revert "base/monitoring: allow setting different contactgroup for systemd" [puppet] - 10https://gerrit.wikimedia.org/r/604308 [09:40:00] (03CR) 10Dzahn: [C: 03+2] Revert "base/monitoring: allow setting different contactgroup for systemd" [puppet] - 10https://gerrit.wikimedia.org/r/604308 (owner: 10Dzahn) [09:43:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:43:33] <_joe_> uh [09:43:33] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests: Create archiva1002 as replacement of archiva1001 - https://phabricator.wikimedia.org/T254890 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A archiva1002.wikimedia.org --vcpus 4 --memory 4 --disk 100 --network public STAR... [09:43:51] if it's for the db it's a bit late icinga... [09:44:40] (03PS1) 10Hnowlan: service::docker: Change volume parameter type [puppet] - 10https://gerrit.wikimedia.org/r/604316 (https://phabricator.wikimedia.org/T220399) [09:48:47] (03PS1) 10Elukey: Add archiva1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/604317 (https://phabricator.wikimedia.org/T254890) [09:48:49] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add SyslogIdentifier=%N to systemd services [puppet] - 10https://gerrit.wikimedia.org/r/604009 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:50:06] (03CR) 10Elukey: [C: 03+2] Add archiva1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/604317 (https://phabricator.wikimedia.org/T254890) (owner: 10Elukey) [09:51:12] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1} site=eqiad topic={udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso [09:51:12] heus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:51:51] <_joe_> we're having another spike of mw errors [09:52:14] <_joe_> is anyone deploying anything? [09:52:32] <_joe_> jayme / akosiaris what's the status of termbox? [09:53:01] switched on mwdebug2002.codfw.wmnet,mw2244.codfw.wmnet,mw1311.eqiad.wmnet _joe_ [09:53:03] _joe_: I don't see anything on fatals [09:53:19] <_joe_> yeah it's grafana acting weird, I'm sorry [09:53:26] <_joe_> but at the same time [09:53:32] <_joe_> it might be logstash is lagging [09:53:35] <_joe_> see the alert up there [09:53:53] <_joe_> lemme go see on mwlog [09:55:42] <_joe_> I see a few "bad data in row XXX" [09:55:49] <_joe_> but overall nothing stands out [09:55:53] 10Operations, 10Domains, 10Traffic: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10johanricher) [09:55:57] _joe_: I did upgrade eventstreams envoy earlier today. Would not expect that to be related, though [09:56:01] <_joe_> but look at grafana https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops [09:56:09] <_joe_> jayme: abs not [09:59:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:59:51] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [10:00:35] (03PS4) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) [10:00:37] (03PS4) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) [10:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11447 and previous config saved to /var/cache/conftool/dbconfig/20200610-100037-marostegui.json [10:00:39] (03CR) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [10:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:47] 10Operations, 10Puppet, 10netbox, 10cloud-services-team (Kanban): Netbox missing physical device in PuppetDB when Puppet disabled for too long - https://phabricator.wikimedia.org/T254986 (10jbond) >As far as I've been told, after a certain time (14d I think) of Puppet being disabled on a host, the host is... [10:01:15] (03PS1) 10Jforrester: Undeploy CollaborationKit: I – Disable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604319 (https://phabricator.wikimedia.org/T254036) [10:01:17] (03PS1) 10Jforrester: Undeploy CollaborationKit: II – Disable on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604320 (https://phabricator.wikimedia.org/T254036) [10:01:19] (03PS1) 10Jforrester: Undeploy CollaborationKit: III – Drop ability to load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604321 (https://phabricator.wikimedia.org/T254036) [10:01:22] (03PS1) 10Jforrester: Undeploy CollaborationKit: IV – Drop flag to load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604322 (https://phabricator.wikimedia.org/T254036) [10:01:24] (03PS1) 10Jforrester: Undeploy CollaborationKit: V – Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604323 (https://phabricator.wikimedia.org/T254036) [10:01:49] jouncebot: next [10:01:49] In 0 hour(s) and 58 minute(s): European Mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1100) [10:02:00] (03CR) 10Jforrester: [C: 03+2] Undeploy CollaborationKit: I – Disable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604319 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:02:07] (03CR) 10Jforrester: [C: 03+2] Undeploy CollaborationKit: II – Disable on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604320 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:02:16] (03PS1) 10Marostegui: mariadb: Enable notification on db1103 [puppet] - 10https://gerrit.wikimedia.org/r/604324 (https://phabricator.wikimedia.org/T253217) [10:02:54] (03Merged) 10jenkins-bot: Undeploy CollaborationKit: I – Disable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604319 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:03:02] (03Merged) 10jenkins-bot: Undeploy CollaborationKit: II – Disable on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604320 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:03:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1103 into x1', diff saved to https://phabricator.wikimedia.org/P11448 and previous config saved to /var/cache/conftool/dbconfig/20200610-100306-marostegui.json [10:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notification on db1103 [puppet] - 10https://gerrit.wikimedia.org/r/604324 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:03:50] !log cloning reviewdb into reviewdb-test at db1132 with replication enabled T254516 [10:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:54] T254516: Get a writable reviewdb clone to test Gerrit upgrade with - https://phabricator.wikimedia.org/T254516 [10:04:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Duh, sorry about that!" [puppet] - 10https://gerrit.wikimedia.org/r/604316 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:04:43] Krinkle: Argh. 8a5401cc5feb5e417419d4df622aa86cfd9728ea was merged but not pulled/deployed? ("Revert "Enable "coalesceKeys" for global keys for WANCache"") [10:05:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, feel free to revert to the defaults to Thanos' memcache because it isn't in production yet" [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [10:08:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1103,db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11449 and previous config saved to /var/cache/conftool/dbconfig/20200610-100834-marostegui.json [10:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:08] OK, spot checks on some machines makes it look like the original patch was never synced. Tsk, but fine. [10:10:48] Ah, but you still have a global scap lock? :-P [10:11:21] (03PS1) 10Marostegui: mariadb: Move db1127 from x1 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) [10:12:21] !log upgrading remaining API servers in codfw to PHP 7.2.31 [10:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1103,db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11450 and previous config saved to /var/cache/conftool/dbconfig/20200610-101407-marostegui.json [10:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:57] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254036 Undeploy CollaborationKit: II – Disable on Test Wikipedia (duration: 01m 37s) [10:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:01] T254036: Undeploy the CollaborationKit extension from Wikipedia production - https://phabricator.wikimedia.org/T254036 [10:17:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [10:17:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:07] (03CR) 10Jforrester: [C: 03+2] Undeploy CollaborationKit: III – Drop ability to load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604321 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:18:56] (03Merged) 10jenkins-bot: Undeploy CollaborationKit: III – Drop ability to load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604321 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:19:42] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) @Ottomata first n00b question - I am trying to think about where to add the netflow schema to the secondary repository, and I have some doubts about the dir structure. Should i... [10:20:35] (03CR) 10Jforrester: [C: 03+2] Undeploy CollaborationKit: IV – Drop flag to load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604322 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:20:36] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T254036 Undeploy CollaborationKit: III – Drop ability to load (duration: 01m 05s) [10:20:38] (03CR) 10Jforrester: [C: 03+2] Undeploy CollaborationKit: V – Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604323 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:25] (03Merged) 10jenkins-bot: Undeploy CollaborationKit: IV – Drop flag to load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604322 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:21:30] (03Merged) 10jenkins-bot: Undeploy CollaborationKit: V – Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604323 (https://phabricator.wikimedia.org/T254036) (owner: 10Jforrester) [10:22:45] (03PS3) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [10:23:19] (03CR) 10Jcrespo: "Worrying about collations." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [10:23:34] !log T254581 re-enabled puppet on all mw, api and jobrunner servers [10:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] T254581: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 [10:24:57] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254036 Undeploy CollaborationKit: IV – Drop flag to load (duration: 01m 05s) [10:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:01] T254036: Undeploy the CollaborationKit extension from Wikipedia production - https://phabricator.wikimedia.org/T254036 [10:25:38] (03CR) 10Elukey: [C: 04-1] "Settings this to -1 since we are not ready yet to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [10:27:50] (03CR) 10Hnowlan: [C: 03+2] service::docker: Change volume parameter type [puppet] - 10https://gerrit.wikimedia.org/r/604316 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:28:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1103,db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11451 and previous config saved to /var/cache/conftool/dbconfig/20200610-102805-marostegui.json [10:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:58] (03PS1) 10Kormat: install_server: Allow rapid prototyping of reuse-parts for d-i-test. [puppet] - 10https://gerrit.wikimedia.org/r/604332 (https://phabricator.wikimedia.org/T254982) [10:31:40] (03CR) 10Marostegui: [C: 03+1] install_server: Allow rapid prototyping of reuse-parts for d-i-test. [puppet] - 10https://gerrit.wikimedia.org/r/604332 (https://phabricator.wikimedia.org/T254982) (owner: 10Kormat) [10:33:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [10:34:09] (03CR) 10Kormat: mariadb: Move db1127 from x1 to s7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:34:34] (03CR) 10Kormat: [C: 03+2] install_server: Allow rapid prototyping of reuse-parts for d-i-test. [puppet] - 10https://gerrit.wikimedia.org/r/604332 (https://phabricator.wikimedia.org/T254982) (owner: 10Kormat) [10:35:34] (03CR) 10Marostegui: mariadb: Move db1127 from x1 to s7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:36:17] (03CR) 10Dzahn: "This did what it was supposed to do but unfortunately also changed custom contactgroups for other things to just "admins". So i reverted i" [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [10:37:15] (03PS2) 10Marostegui: mariadb: Move db1127 from x1 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) [10:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 moving it to s7 T253217', diff saved to https://phabricator.wikimedia.org/P11452 and previous config saved to /var/cache/conftool/dbconfig/20200610-103742-marostegui.json [10:37:45] (03CR) 10Kormat: mariadb: Move db1127 from x1 to s7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:47] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [10:38:21] (03CR) 10Marostegui: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:38:46] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [10:39:47] (03CR) 10Kormat: [C: 03+1] mariadb: Move db1127 from x1 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:40:17] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10Patch-For-Review: Create archiva1002 as replacement of archiva1001 - https://phabricator.wikimedia.org/T254890 (10jbond) p:05Triage→03Medium [10:40:26] (03PS3) 10Marostegui: mariadb: Move db1127 from x1 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) [10:41:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1127 from x1 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/604328 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:42:42] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10jbond) @vhargyono-WMF i have added you to the wmf ldap group please reopen if you are still unable to access the desired services [10:42:54] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Vasanthi Hargyono - https://phabricator.wikimedia.org/T254961 (10jbond) 05Open→03Resolved a:03jbond [10:43:08] 10Operations, 10Patch-For-Review: reuse-parts.sh: provide feedback to user when something fails - https://phabricator.wikimedia.org/T254982 (10jbond) p:05Triage→03Medium [10:44:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:26] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) [10:50:09] (03PS1) 10Dzahn: add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) [10:50:34] (03CR) 10jerkins-bot: [V: 04-1] add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [10:51:42] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) @Papaul I uploaded a new change to add mgmt and production IPs for mw2335-mw2339 (C3). Does it look good to you? [10:54:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [10:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:42] (03PS2) 10Dzahn: add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) [10:57:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:08] 10Operations, 10Patch-For-Review: reuse-parts.sh: provide feedback to user when something fails - https://phabricator.wikimedia.org/T254982 (10Kormat) a:03Kormat [10:59:17] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1100). [11:00:21] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) @AndrewKuznetsov I have updated the ticket using the [[https://phabricator.wikimedia.org/maniphest/tas... [11:00:41] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) p:05Triage→03Medium [11:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1094 moving to clone db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11453 and previous config saved to /var/cache/conftool/dbconfig/20200610-110204-marostegui.json [11:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:08] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [11:02:48] !log Stop MySQL on db1094 to clone db1127 [11:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] (03PS1) 10Dzahn: gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) [11:09:40] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [11:12:26] (03PS2) 10Dzahn: gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) [11:12:51] (03CR) 10Dzahn: "The password has to be changed in private/hieradata/hosts/gerrit1002.yaml:gerrit::server::db_pass along with this to switch gerrit1002 fro" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [11:13:47] !log upgrading remaining job runners in codfw to PHP 7.2.31 [11:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:19] (03CR) 10Dzahn: [C: 03+1] "This was still needed but if we really don't need mysql at all anymore as soon as we are on 3.1... we can abandon this I guess." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/548549 (owner: 10Paladox) [11:17:07] (03CR) 10Dzahn: [C: 04-1] "Error: Function lookup() did not find a value for the name 'gerrit::server::db_name'" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [11:20:08] (03PS3) 10Dzahn: gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) [11:22:57] (03PS1) 10Dzahn: mx: change officeit@ to its@ email recipient for exim alias mail [puppet] - 10https://gerrit.wikimedia.org/r/604350 [11:26:54] (03PS2) 10Urbanecm: Grant cswiki accountcreators tboverride-account and override-antispoof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604112 (https://phabricator.wikimedia.org/T254927) [11:30:46] (03PS1) 10Hnowlan: service::docker: correct param name [puppet] - 10https://gerrit.wikimedia.org/r/604354 (https://phabricator.wikimedia.org/T220399) [11:31:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:32:36] (03CR) 10Muehlenhoff: [C: 03+2] On buster install python3-tqdm from the spicerack component [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [11:34:07] * Urbanecm goes to deploy a config patch [11:34:12] (03CR) 10Urbanecm: [C: 03+2] Grant cswiki accountcreators tboverride-account and override-antispoof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604112 (https://phabricator.wikimedia.org/T254927) (owner: 10Urbanecm) [11:35:08] (03Merged) 10jenkins-bot: Grant cswiki accountcreators tboverride-account and override-antispoof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604112 (https://phabricator.wikimedia.org/T254927) (owner: 10Urbanecm) [11:37:38] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 52091b8: Grant cswiki accountcreators tboverride-account and override-antispoof (T254927) (duration: 01m 06s) [11:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:42] T254927: Allow cswiki account creators to bypass title blacklist and antispoof blacklist - https://phabricator.wikimedia.org/T254927 [11:38:06] * Urbanecm done [11:38:06] !log Deploy schema change on testcommonswiki T255003 [11:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:10] T255003: Review & apply schema changes for T250748 - https://phabricator.wikimedia.org/T255003 [11:41:11] !log upgrading remaining app servers in codfw to PHP 7.2.31 [11:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:27] !log Deploy schema change on commonswiki codfw T255003 [11:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:32] T255003: Review & apply schema changes for T250748 - https://phabricator.wikimedia.org/T255003 [11:53:01] (03PS3) 10Ema: ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 [12:01:05] (03PS1) 10Ssingh: dnsdist: set default provider (TLS library) for DoT [puppet] - 10https://gerrit.wikimedia.org/r/604356 (https://phabricator.wikimedia.org/T252132) [12:02:35] (03PS1) 10Esanders: Labs: Add visualeditor-realtime.wmflabs.org to CSP's approved domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604358 [12:03:05] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/23142/" [puppet] - 10https://gerrit.wikimedia.org/r/604356 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:06:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] "How come this never bit us still..." [puppet] - 10https://gerrit.wikimedia.org/r/604354 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [12:07:27] (03CR) 10Hnowlan: [C: 03+2] service::docker: correct param name [puppet] - 10https://gerrit.wikimedia.org/r/604354 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [12:12:36] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor2001.codfw.wmnet [12:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:40] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor2002.codfw.wmnet [12:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:06] !log pool thumbor2002, thumbor2001. T251570 [12:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:10] T251570: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 [12:15:54] 10Operations, 10Traffic: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [12:16:02] 10Operations, 10Traffic: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) p:05Triage→03Medium [12:18:20] (03PS4) 10Ema: ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 (https://phabricator.wikimedia.org/T255015) [12:22:00] (03PS1) 10Ema: varnish: narrow down healthchecks definition [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) [12:25:09] (03PS6) 10Ayounsi: Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 [12:25:30] (03CR) 10Ayounsi: [C: 03+2] Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [12:32:37] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [12:32:37] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [12:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:22] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [12:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:39] ^ that's just a test [12:34:56] (03CR) 10Dzahn: [C: 04-1] "unfortunately not working because https://gerrit.wikimedia.org/r/c/operations/puppet/+/587233/ had to be reverted after it took down Phabr" [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [12:44:31] (03CR) 10Dzahn: memcached: allow more tunables to avoid implicit settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [12:45:59] elukey: if so many places change the default for memcached parameters and the actual default is specific to MediaWiki, shouldn't we change the defaults and explicitly set different values for MW instead of the other way around? or did i get that wrong? [12:46:46] mutante: the change is meant to be a no-op, we are actually using slab sizes that were hard-wired into the memcached classes but that are mediawiki objcache specific [12:47:20] the values are now explicitly stated to avoid any change from the actual status quo, then the "Review" part is meant to either use defaults or do something else [12:47:51] if you see in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/603942/6/modules/profile/manifests/memcached/instance.pp defaults have changed [12:48:36] and also here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/603942/6/modules/memcached/manifests/init.pp [12:49:05] (min_slab_size was 5 in the systemd unit before, now 48, as memcached defaults - same thing for growth_factor) [12:49:33] elukey: ooh. they _did_ change.. ok, ok. then i get it [12:49:43] (03PS1) 10Alexandros Kosiaris: docker-service-shim: Fix ERB syntax [puppet] - 10https://gerrit.wikimedia.org/r/604373 (https://phabricator.wikimedia.org/T220399) [12:50:21] (03CR) 10Hnowlan: [C: 03+1] docker-service-shim: Fix ERB syntax [puppet] - 10https://gerrit.wikimedia.org/r/604373 (https://phabricator.wikimedia.org/T220399) (owner: 10Alexandros Kosiaris) [12:50:47] (03CR) 10Hnowlan: [C: 03+2] docker-service-shim: Fix ERB syntax [puppet] - 10https://gerrit.wikimedia.org/r/604373 (https://phabricator.wikimedia.org/T220399) (owner: 10Alexandros Kosiaris) [12:51:04] elukey: for the file simplelamp2, what i wanted is defaults simply because it is meant to be a generic role to be used by different things in cloud VPS. not specific to a project. [12:52:13] i guess since so many places use these numbers they are fine and the TODO can be removed [12:52:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) @ wiki_willy Dell will not do anything further. unless we renew/upgrade warranty to pro-support. because t... [12:53:00] mutante: yeah I added you to the code change to know what was the use case, I wanted to go for a full no-op to avoid any change if merged. After this anybody can amend their classes with defaults (removing the current class parameters) or tune the values [12:53:14] my fear is that a lot of systems use mw's defaults without knowing it [12:54:20] so I would let service owners/experts to remove that TODO [12:54:35] elukey: use-case is "replaced simplelamp". a generic role for cloud VPS users to slap on a VM if they want "just a LAMP stack" and memcached used to be in it before as well.. so continued it. i want the non-mediawiki defaults [12:54:45] (03CR) 10Dzahn: [C: 03+1] memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [12:54:49] elukey: ack, +1 [12:55:12] thanks for the review! [13:00:04] longma and liw: Time to snap out of that daydream and deploy Mediawiki train - American+European Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1300). [13:00:09] (03PS2) 10Muehlenhoff: profile::analytics::database::meta: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/599696 [13:02:04] (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604376 [13:02:06] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604376 (owner: 10Lars Wirzenius) [13:03:09] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604376 (owner: 10Lars Wirzenius) [13:04:13] (03CR) 10Dzahn: dnsdist: set default provider (TLS library) for DoT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604356 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:05:12] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.36 [13:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:35] (03CR) 10Elukey: [C: 03+1] "Shame on Luca for the analytics nodes :D" [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [13:06:17] !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.36 (duration: 01m 04s) [13:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:21] (03CR) 10Ssingh: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604356 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:09:16] (03CR) 10Dzahn: [C: 03+2] mx: change officeit@ to its@ email recipient for exim alias mail [puppet] - 10https://gerrit.wikimedia.org/r/604350 (owner: 10Dzahn) [13:11:29] 10Operations, 10Gujarati-Sites, 10User-Urbanecm, 10Wiki-Setup (Create): Create Gujarati Wikisource - https://phabricator.wikimedia.org/T37138 (10CptViraj) [13:12:22] (03CR) 10Dzahn: [C: 03+1] dnsdist: set default provider (TLS library) for DoT [puppet] - 10https://gerrit.wikimedia.org/r/604356 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:14:14] (03PS1) 10Marostegui: check_mariadb.py: Add check for the event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) [13:14:15] (03CR) 10Ssingh: [C: 03+2] dnsdist: set default provider (TLS library) for DoT [puppet] - 10https://gerrit.wikimedia.org/r/604356 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:15:04] (03CR) 10jerkins-bot: [V: 04-1] check_mariadb.py: Add check for the event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [13:17:03] (03PS1) 10Jbond: facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 [13:17:38] (03PS4) 10Huji: Set wgCheckUserLogLogins to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) [13:17:39] (03CR) 10jerkins-bot: [V: 04-1] facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 (owner: 10Jbond) [13:18:55] (03PS1) 10Filippo Giunchedi: site: add thanos-be1* to thanos::backend [puppet] - 10https://gerrit.wikimedia.org/r/604383 (https://phabricator.wikimedia.org/T252186) [13:20:14] (03PS2) 10Jbond: facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 [13:20:14] (03CR) 10jerkins-bot: [V: 04-1] facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 (owner: 10Jbond) [13:20:34] (03PS2) 10Marostegui: check_mariadb.py: Add check for the event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) [13:22:11] (03CR) 10Marostegui: "root@db2102:/home/marostegui# ./check_mariadb.py" [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [13:22:11] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add thanos-be1* to thanos::backend [puppet] - 10https://gerrit.wikimedia.org/r/604383 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:22:21] (03PS2) 10Filippo Giunchedi: site: add thanos-be1* to thanos::backend [puppet] - 10https://gerrit.wikimedia.org/r/604383 (https://phabricator.wikimedia.org/T252186) [13:24:22] (03PS3) 10Jbond: facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 [13:25:45] (03CR) 10CDanis: [C: 03+1] prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:28:42] (03CR) 10Vgutierrez: [C: 03+1] ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:30:03] (03CR) 10Vgutierrez: [C: 03+1] "So this is gonna ruin my infamous /from/vgutierrez tests, +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:31:16] (03CR) 10CDanis: [C: 03+1] ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:31:31] (03CR) 10CDanis: [C: 03+1] varnish: narrow down healthchecks definition [puppet] - 10https://gerrit.wikimedia.org/r/604364 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:31:37] (03PS5) 10Ema: ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 (https://phabricator.wikimedia.org/T255015) [13:32:34] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10elukey) Ah ok thanks for the explanation, now it is more clear. We have to keep in mind that mcrouter can have tkos and stop sending keys to a particular shard if not respons... [13:35:32] (03CR) 10Ema: [C: 03+2] ATS: handle healthchecks in Lua [puppet] - 10https://gerrit.wikimedia.org/r/604305 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:36:20] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [13:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:06] !log cp3050: ats-backend-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015 [13:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:10] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:40:21] (03PS3) 10Dzahn: add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) [13:41:20] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [13:44:42] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-fe1002.eqiad.wmnet ` The log can be fou... [13:45:32] PROBLEM - ps1-a4-eqiad-infeed-load-tower-A-phase-Z on ps1-a4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:46:20] Uh.. I cannot see RC on meta (Internal error) [13:46:24] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-Z on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-Z 439 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:46:35] ^^ this is me testing [13:46:37] (03CR) 10Papaul: [C: 03+1] site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [13:46:46] ahha [13:47:08] I think jbond42 means the PDU alerts, not the RecentChanges issue (which I see as well) [13:47:35] cdanis: Sotiale: yes soprry i did mean the pdu alerts [13:47:37] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10Andrew) [13:47:51] InvalidArgumentException from line 142 of /srv/mediawiki/php-1.35.0-wmf.36/includes/title/NamespaceInfo.php: NamespaceInfo::isTalk called with non-integer (string) namespace '-1' [13:48:41] seems new as of about 13:00, did we roll a train this morning? [13:49:33] liw: ^^^ [13:50:01] cdanis: it went through at 13:06 UTC [13:50:13] based on logmsgbot [13:50:23] yeah just found it in sal [13:50:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1094 into s7', diff saved to https://phabricator.wikimedia.org/P11457 and previous config saved to /var/cache/conftool/dbconfig/20200610-135039-marostegui.json [13:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:44] !log cp3050: ats-tls-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015 [13:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:47] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [13:51:06] that is approx the timestamp that these errors begin on ... both meta and enwikiquote [13:51:14] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-fe1001.eqiad.wmnet ` The log can be found in `/var... [13:51:23] and all for wmf.36 [13:52:19] cdanis: its not every .36 wiki [13:52:33] yep, just meta and enwikiquote afaict [13:52:38] afwikiquote and mediawikiwiki are fine [13:52:45] must be something about their configs [13:53:22] Still a blocker [13:53:23] 10Operations, 10Analytics, 10Analytics-Kanban, 10observability, 10Patch-For-Review: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10Ottomata) Let's! [13:53:50] https://phabricator.wikimedia.org/T255016 [13:54:36] James_F: ^ should that not block the train [13:54:45] it is a train blocker [13:54:49] Theres a patch waiting [13:54:51] oh, no, sorry [13:55:42] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/604386 ? [13:56:16] cdanis: thats got the same bug id on of the task james merged it into [13:56:28] yeah [13:57:02] I suspect the patch just needs to be merged and backported [13:57:20] https://phabricator.wikimedia.org/T253098#6211056 [13:57:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1094 into s7', diff saved to https://phabricator.wikimedia.org/P11458 and previous config saved to /var/cache/conftool/dbconfig/20200610-135753-marostegui.json [13:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:08] (03PS1) 10Papaul: DHCP: Add MAC address for mw233[5-9] [puppet] - 10https://gerrit.wikimedia.org/r/604393 (https://phabricator.wikimedia.org/T241852) [13:58:22] RhinosF1: I don't think it should block the train, but it'd be nice if someone merged my patch so I could backport. [13:58:39] (03PS5) 10JMeybohm: eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) [13:58:40] (03CR) 10Jbond: [C: 03+2] cookbooks sre.hosts.rotate-pdu-password: refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [13:58:42] (03PS4) 10Dzahn: add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) [13:58:44] (03CR) 10Jbond: [C: 03+2] cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/598020 (owner: 10Jbond) [13:58:51] (03CR) 10Jbond: [C: 03+2] sre.pdus.rotate-password: split generic functions out to __init__.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond) [13:58:55] (03CR) 10Jbond: [C: 03+2] cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [13:58:55] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Yeah we don't have a great convention for namepacing. For analytics/instrumentation schemas, we decided to keep things simple and keep the hierarchy mostly flat, e.g. analyt... [13:59:11] (03PS11) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) [13:59:14] James_F: it's breaking RC though on at least 2 wikis [13:59:24] (03PS3) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/598020 [13:59:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond) [13:59:34] (03PS9) 10Jbond: sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 [13:59:37] it didn't break enwikiquote for me [13:59:38] (03CR) 10jerkins-bot: [V: 04-1] sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond) [13:59:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [13:59:43] I see some errors in the logs, odd that it is inconsistent [13:59:44] (03PS41) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:59:46] (03CR) 10JMeybohm: [C: 03+2] eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:00:10] both meta and enwikiquote are broken for me [14:00:10] James_F: patch is +1 from me ;) [14:00:14] * jbond42 almost wants to do one more commit just to get to PS42 [14:00:22] cdanis: It's bad user input. [14:00:30] Thanks. [14:00:33] (03Merged) 10jenkins-bot: eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:01:26] Majavah: Try https://meta.wikimedia.org/wiki/Special:RecentChanges?hidebots=1&translations=filter&hidecategorization=1&limit=50&days=7&enhanced=1&urlversion=2 rather than whatever your saved preference is trying for. [14:01:30] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/598020 (owner: 10Jbond) [14:01:38] (03PS5) 10Ottomata: systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) [14:02:12] James_F: WFM [14:02:16] James_F: that works, but default settings for meta are broken [14:02:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:05] James_F, re https://phabricator.wikimedia.org/T253098 -- do you think it's a train blocker now? or fixed soon? [14:03:19] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address for mw233[5-9] [puppet] - 10https://gerrit.wikimedia.org/r/604393 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [14:03:38] liw: Fixed soon. Not a train blocker. [14:03:54] Majavah: Default settings WFM in an incognito window. [14:04:08] James_F, ta [14:04:16] James_F: hmh, now WFM for me too [14:04:25] Majavah: Your /saved/ "default" view might well not work, however. [14:04:36] (03CR) 10Ottomata: systemd::timer::job - add ability to syslog match based on programname equality (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [14:05:00] Incog works fine [14:05:04] (03PS6) 10Ottomata: systemd::timer::job - add ability to syslog match based on programname equality [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) [14:05:16] I thought my check was in incognito, might have accidentally opened just a regular window :/ [14:06:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:26] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe1002.eqiad.wmnet'] ` and were **ALL** successful. [14:06:29] If we'd broken incognito view I'd think it was a train blocker, yes. [14:07:21] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10CDanis) [14:08:05] (03Abandoned) 10CDanis: ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 (owner: 10CDanis) [14:08:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:08] (03PS5) 10Dzahn: add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) [14:09:52] !log A:cp rolling ats-be/ats-tls restarts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015 [14:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:56] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [14:11:02] (03CR) 10Paladox: [C: 03+1] "need to add the param to gerrit-prod-1001 too, otherwise +1" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [14:11:16] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:12:21] uh... [14:12:22] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:12:38] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [14:13:01] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe1001.eqiad.wmnet'] ` and were **ALL** successful. [14:13:03] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) @ayounsi any suggestion? `netflow/flow/something` ? [14:13:05] (03CR) 10Paladox: [C: 04-1] "Missed a spot, need to add to jetty.pp and gerrit.config" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [14:14:06] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) [14:14:24] PROBLEM - Ensure traffic_server is running for instance backend on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:14:38] (03PS6) 10Dzahn: add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) [14:14:47] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) 05Open→03Resolved @fgiunchedi These are moved and installed. Resolving this task. [14:14:52] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) > We have to keep in mind that mcrouter can have tkos and stop sending keys to a particular shard if not responsive, so even in this case the replication could end up... [14:15:34] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3056 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.219 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:15:40] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:16:16] PROBLEM - Ensure traffic_server is running for instance backend on cp3052 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:17:57] (03PS4) 10Dzahn: gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) [14:18:10] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3062 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.204 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:18:16] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.1 200 Ok - 35210 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:19:08] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [14:19:40] (03CR) 10Ottomata: [C: 03+2] "Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [14:19:45] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) The jump to 7.2.31 didn't help ` reedy@deployment-depl... [14:21:57] !log cp3056: ats-backend-restart [14:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:32] (03CR) 10Jcrespo: "Looks good, but let me test it before production deploy." [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [14:23:38] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1083 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.020 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:23:54] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:24:08] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5010 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.499 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:24:38] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:24:52] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 25985 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:24:58] (03CR) 10Papaul: [C: 03+1] add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [14:25:04] RECOVERY - Ensure traffic_server is running for instance backend on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:06] PROBLEM - Ensure traffic_server is running for instance backend on cp3058 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:28] PROBLEM - Ensure traffic_server is running for instance backend on cp5010 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:30] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3058 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.214 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:48] (03CR) 10Andrew Bogott: [C: 03+1] "assuming the pcc approves, this seems just fine to me. We could also just check to see if the vars were non-empty and make them empty by " [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:27:12] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:27:36] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3060 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.194 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:27:58] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4030 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.180 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:28:06] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:28:18] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5008 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.490 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:28:19] !log systemctl restart trafficserver for instances critical in icinga [14:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:56] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1085 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.380 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:29:02] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:29:24] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:29:50] PROBLEM - Ensure traffic_server is running for instance backend on cp3060 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:30:16] PROBLEM - Ensure traffic_server is running for instance backend on cp4030 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:30:32] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3054 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 5.728 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:30:38] PROBLEM - Ensure traffic_server is running for instance backend on cp5008 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:30:58] PROBLEM - Ensure traffic_server is running for instance backend on cp5011 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:06] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.1 200 Ok - 35174 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:08] PROBLEM - Ensure traffic_server is running for instance backend on cp1085 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:14] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.1 200 Ok - 34862 bytes in 6.056 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:14] PROBLEM - Ensure traffic_server is running for instance backend on cp3062 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:16] RECOVERY - Ensure traffic_server is running for instance backend on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:32] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 25974 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:34] RECOVERY - Ensure traffic_server is running for instance backend on cp3058 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:42] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 25980 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:52] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:31:56] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 26002 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:32:50] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.1 200 Ok - 35227 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:33:18] PROBLEM - Ensure traffic_server is running for instance backend on cp3054 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:33:30] RECOVERY - Ensure traffic_server is running for instance backend on cp5010 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:33:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Cmjohnson) 05Open→03Resolved All of these servers have been removed from the racks, networks switch and moved to offline in... [14:33:38] RECOVERY - Ensure traffic_server is running for instance backend on cp5008 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:33:38] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5010 is OK: HTTP OK: HTTP/1.0 200 OK - 25841 bytes in 0.819 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:33:38] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5008 is OK: HTTP OK: HTTP/1.1 200 Ok - 35112 bytes in 0.846 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:34:00] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5008 is OK: HTTP OK: HTTP/1.0 200 OK - 25946 bytes in 0.926 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:34:28] PROBLEM - Ensure traffic_server is running for instance backend on cp1089 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:34:39] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10Cmjohnson) @akosiaris Have any the initial steps been completed with this decom task? [14:34:48] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5010 is OK: HTTP OK: HTTP/1.1 200 Ok - 35170 bytes in 0.780 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:34:51] RECOVERY - Ensure traffic_server is running for instance backend on cp4030 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:35:12] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4030 is OK: HTTP OK: HTTP/1.0 200 OK - 25914 bytes in 0.246 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:35:12] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4030 is OK: HTTP OK: HTTP/1.1 200 Ok - 34960 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:35:24] PROBLEM - Ensure traffic_server is running for instance backend on cp1077 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:35:58] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1077 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.050 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:36:06] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:36:20] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 25863 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:36:21] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.1 200 Ok - 34809 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:37:22] RECOVERY - Ensure traffic_server is running for instance backend on cp1085 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:37:44] RECOVERY - Ensure traffic_server is running for instance backend on cp1089 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:38:29] (03CR) 10QChris: [C: 04-1] gerrit: add parameter for db_name, let gerrit1002 use test db (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [14:38:30] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:38:30] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:22] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 8.742 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:28] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3064 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.211 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:44] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:46] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4028 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.182 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:59] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10MoritzMuehlenhoff) [14:40:39] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10MoritzMuehlenhoff) @Cmjohnson No, these are still running. I've updated the task to use the new decom checklist to better reflect the status quo. [14:41:32] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1081 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 6.722 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:41:48] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:41:48] PROBLEM - Ensure traffic_server is running for instance backend on cp4028 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:41:59] (03CR) 10Muehlenhoff: [C: 03+2] Create debmonitor-client system user using systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/599049 (owner: 10Muehlenhoff) [14:42:03] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) [14:42:10] (03CR) 10QChris: [C: 04-1] "Forgot to say: Thanks for jumping on this! You're awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [14:42:16] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) [14:42:43] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) [14:42:50] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2035 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.111 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:42:58] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:43:16] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2039 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.101 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:43:28] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:43:51] !log A:cp rolling systemctl restart trafficserver [14:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:22] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:44:34] PROBLEM - Ensure traffic_server is running for instance backend on cp1081 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:44:44] PROBLEM - Ensure traffic_server is running for instance backend on cp2035 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:44:46] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4031 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.202 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:44:50] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2037 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.111 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:45:08] PROBLEM - Ensure traffic_server is running for instance backend on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:45:12] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:45:54] PROBLEM - Ensure traffic_server is running for instance backend on cp2039 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:45:58] RECOVERY - Ensure traffic_server is running for instance backend on cp3052 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:09] (03CR) 10Dzahn: [C: 03+2] add management and production IPs for mw2335-mw2339 [dns] - 10https://gerrit.wikimedia.org/r/604339 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [14:46:20] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2035 is OK: HTTP OK: HTTP/1.0 200 OK - 25873 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:20] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 25881 bytes in 5.678 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:26] PROBLEM - Ensure traffic_server is running for instance backend on cp4031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:28] RECOVERY - Ensure traffic_server is running for instance backend on cp2035 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:28] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:30] PROBLEM - Ensure traffic_server is running for instance backend on cp2037 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:36] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.1 200 Ok - 34681 bytes in 6.776 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:38] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.0 200 OK - 25912 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:48] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2031 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.094 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:52] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2035 is OK: HTTP OK: HTTP/1.1 200 Ok - 34872 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:46:52] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.1 200 Ok - 35192 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:47:00] RECOVERY - Ensure traffic_server is running for instance backend on cp4028 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:47:24] 10Operations, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10MoritzMuehlenhoff) Reported to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=962535 [14:47:26] RECOVERY - Ensure traffic_server is running for instance backend on cp5011 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:47:44] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Ya if possible, the schema should be named and modeled after what the event represents. In this case it sounds like it is something like a 'network co... [14:47:46] RECOVERY - Ensure traffic_server is running for instance backend on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:04] RECOVERY - Ensure traffic_server is running for instance backend on cp1081 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:06] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.1 200 Ok - 35115 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:11] RECOVERY - Ensure traffic_server is running for instance backend on cp4031 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:18] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4031 is OK: HTTP OK: HTTP/1.0 200 OK - 25847 bytes in 3.700 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:18] RECOVERY - Ensure traffic_server is running for instance backend on cp2037 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:18] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2037 is OK: HTTP OK: HTTP/1.0 200 OK - 25868 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:22] (03CR) 10Dzahn: [C: 03+2] site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [14:48:24] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.1 200 Ok - 35138 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:24] RECOVERY - Ensure traffic_server is running for instance backend on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:28] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 25950 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:34] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2037 is OK: HTTP OK: HTTP/1.1 200 Ok - 34878 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:36] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 25988 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:38] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2029 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 3.122 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:40] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.1 200 Ok - 35128 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:44] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:48:58] RECOVERY - Ensure traffic_server is running for instance backend on cp1077 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:49:30] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4031 is OK: HTTP OK: HTTP/1.1 200 Ok - 34956 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:49:30] PROBLEM - Ensure traffic_server is running for instance backend on cp2031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:49:40] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 25983 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:49:43] (03PS1) 10Majavah: Add wsgi-file for meet-accountmanager [puppet] - 10https://gerrit.wikimedia.org/r/604409 (https://phabricator.wikimedia.org/T251034) [14:49:52] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2031 is OK: HTTP OK: HTTP/1.1 200 Ok - 34516 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:50:22] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2031 is OK: HTTP OK: HTTP/1.0 200 OK - 25873 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:50:28] PROBLEM - Ensure traffic_server is running for instance backend on cp2029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:50:30] (03PS1) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [14:50:30] RECOVERY - Ensure traffic_server is running for instance backend on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:50:31] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 34887 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:51:14] 10Operations, 10Analytics-Cluster, 10Analytics-Radar, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10elukey) [14:51:16] RECOVERY - Ensure traffic_server is running for instance backend on cp2031 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:51:28] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 25952 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:51:28] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.1 200 Ok - 35221 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:51:40] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 26007 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:52:08] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2039 is OK: HTTP OK: HTTP/1.0 200 OK - 25879 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:52:14] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2039 is OK: HTTP OK: HTTP/1.1 200 Ok - 34868 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:52:32] (03CR) 10Dzahn: [C: 03+2] Add wsgi-file for meet-accountmanager [puppet] - 10https://gerrit.wikimedia.org/r/604409 (https://phabricator.wikimedia.org/T251034) (owner: 10Majavah) [14:53:00] RECOVERY - Ensure traffic_server is running for instance backend on cp2039 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:53:26] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:54:02] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2029 is OK: HTTP OK: HTTP/1.0 200 OK - 25922 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:54:02] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2029 is OK: HTTP OK: HTTP/1.1 200 Ok - 34937 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:54:03] (03PS2) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [14:54:04] RECOVERY - Ensure traffic_server is running for instance backend on cp2029 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:54:09] (03CR) 10Ssingh: "Thanks for the review! Additionally, here is the pcc output: https://puppet-compiler.wmflabs.org/compiler1003/23145/ (one of the DNS boxes" [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:55:45] (03PS3) 10Dzahn: site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) [14:56:51] (03PS3) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [14:57:42] (03PS1) 10Kormat: install_server: Better error reporting for reuse-parts [puppet] - 10https://gerrit.wikimedia.org/r/604413 (https://phabricator.wikimedia.org/T254982) [14:57:57] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member xe-7/0/3 { ... } + member ge-3/0/3; + member ge-3/0/4; +... [14:58:44] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/23146/ [cloudservices1003.wikimedia.org]" [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:58:57] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:59:11] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [14:59:30] (03PS1) 10Filippo Giunchedi: swift: allow x-static-large-object header [puppet] - 10https://gerrit.wikimedia.org/r/604415 (https://phabricator.wikimedia.org/T254852) [14:59:40] (03PS4) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [15:01:20] (03PS1) 10Muehlenhoff: Update changelog [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/604416 [15:02:52] !log A:cp-ulsfo: rolling ats-tls-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015 [15:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:56] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [15:04:01] (03PS4) 10Dzahn: site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) [15:04:17] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2335.codfw.wmnet ` The log can be... [15:06:07] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) rdb2007 is having some HW issues and I open a case with Dell. A Call is set for tomorrow 10am CT . See below for case information Hello Papaul Your case # is SR 65924... [15:08:01] !log roll-restart prometheus k8s to enable thanos upload [15:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:38] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:08:52] (03PS5) 10Dzahn: site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) [15:09:37] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [15:10:42] (03PS6) 10Dzahn: site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) [15:11:25] (03CR) 10Dzahn: [C: 03+2] site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [15:11:45] there will be a bunch of alerts re: prometheus restarting, expected [15:11:57] (03CR) 10Jcrespo: [C: 03+2] setup.py: Add RemoteExecution module to setup.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/602879 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [15:12:47] (03CR) 10Cwhite: [C: 03+1] swift: allow x-static-large-object header [puppet] - 10https://gerrit.wikimedia.org/r/604415 (https://phabricator.wikimedia.org/T254852) (owner: 10Filippo Giunchedi) [15:14:24] 10Puppet, 10Wikimedia Meet, 10Patch-For-Review: Puppetize the meet account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) The role meet::accountmanager has been applied to the instance meet-auth. Puppet ran and created the user/group "meet-auth". Then I moved the /srv/meet-auth directory out... [15:17:19] * Pchelolo and _joe_ are gonna deploy some MW config changes [15:18:03] (03PS5) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) [15:18:03] * _joe_ waves [15:18:14] (03CR) 10Ppchelko: [C: 03+2] [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [15:19:12] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: allow x-static-large-object header [puppet] - 10https://gerrit.wikimedia.org/r/604415 (https://phabricator.wikimedia.org/T254852) (owner: 10Filippo Giunchedi) [15:19:13] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:38] (03Merged) 10jenkins-bot: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [15:19:48] PROBLEM - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [15:19:56] PROBLEM - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [15:20:11] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2336.codfw.wmnet ` The log can be... [15:20:12] (03PS5) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [15:21:04] (03CR) 10Volans: [C: 03+1] "LGTM, one optional thing inline" (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/604416 (owner: 10Muehlenhoff) [15:21:19] (03PS6) 10Jbond: cookbooks sre.pdus: add uptime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) [15:21:40] PROBLEM - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [15:21:45] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:38] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10MoritzMuehlenhoff) >>! In T239323#6165214, @jbond wrote: > I had a look at this on the CAS side and i think it would be doable to add some l... [15:25:30] PROBLEM - Prometheus prometheus1004/k8s restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [15:25:38] RECOVERY - Thanos compact has disappeared from Prometheus discovery on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [15:26:05] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia 01174105 - Our vendor is hoping to have the issue resolved within the next 3 hours. - The acknowledgement expires at: 2020-06-10 18:25:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:26:05] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 60, down: 2, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia 01174105 - Our vendor is hoping to have the issue resolved within the next 3 hours. - The acknowledgement expires at: 2020-06-10 18:25:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:26:05] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia 01174105 - Our vendor is hoping to have the issue resolved within the next 3 hours. - The acknowledgement expires at: 2020-06-10 18:25:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:26:50] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2335.codfw.wmnet'] ` and were **ALL** successful. [15:26:53] (03CR) 10Muehlenhoff: cas-icinga: Add an entry point for the external monitoring script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [15:26:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:27:39] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Make kafka purges config more robust, gerrit:603649, IS.php (duration: 01m 08s) [15:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:58] (03PS2) 10Ppchelko: Enable kafka purges everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603654 (https://phabricator.wikimedia.org/T250781) [15:29:03] (03PS1) 10Hnowlan: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) [15:29:07] (03CR) 10Jbond: "Ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [15:29:11] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: Make kafka purges config more robust, gerrit:603649, CS.php (duration: 01m 05s) [15:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] (03CR) 10Ppchelko: [C: 03+2] Enable kafka purges everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603654 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [15:30:09] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2337.codfw.wmnet ` The log can be... [15:30:21] (03CR) 10Dzahn: gerrit: add parameter for db_name, let gerrit1002 use test db (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [15:30:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/604416 (owner: 10Muehlenhoff) [15:30:50] (03Merged) 10jenkins-bot: Enable kafka purges everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603654 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [15:32:37] !log remaining-cp (non-ulsfo): rolling ats-tls-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015 [15:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:41] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [15:34:18] (03PS5) 10Dzahn: gerrit: add parameter for db_name, let gerrit1002 use test db [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) [15:35:07] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:38] (03CR) 10Gilles: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/598416 (https://phabricator.wikimedia.org/T253375) (owner: 10Gilles) [15:36:08] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) `netflow/flow/record` or `netflow/flow/observe` could be ok? [15:36:20] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Send kafka purges everywhere, gerrit:603654 (duration: 01m 05s) [15:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:43] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:37:44] <_joe_> it is trending up [15:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:28] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:23] RECOVERY - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [15:42:29] RECOVERY - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [15:43:52] (03PS3) 10Privacybatm: transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) [15:44:01] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2336.codfw.wmnet'] ` and were **ALL** successful. [15:45:06] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:22] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2338.codfw.wmnet ` The log can be... [15:45:39] (03PS4) 10Privacybatm: transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) [15:45:41] (03PS8) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [15:46:01] (03PS2) 10Gilles: Fix Python 3 compatibility and flake8 errors [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/598416 (https://phabricator.wikimedia.org/T253375) [15:46:44] (03PS3) 10Ppchelko: Disable HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781) [15:49:22] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] (03CR) 10Gilles: [C: 03+2] Fix Python 3 compatibility and flake8 errors [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/598416 (https://phabricator.wikimedia.org/T253375) (owner: 10Gilles) [15:51:01] (03Merged) 10jenkins-bot: Fix Python 3 compatibility and flake8 errors [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/598416 (https://phabricator.wikimedia.org/T253375) (owner: 10Gilles) [15:51:41] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) I'd guess I'd still ask what is a "netflow"? Or a "netflow/flow"? I guess if you could defined 'netflow' as a noun in the description of the schema,... [15:52:54] (03PS1) 10Dzahn: site/conftool: add mw2335-mw2339 as appservers [puppet] - 10https://gerrit.wikimedia.org/r/604429 (https://phabricator.wikimedia.org/T241852) [15:54:12] (03PS5) 10Gilles: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) [15:55:11] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10elukey) Side note: if not already done, I'd double check how the WarmUp route behaves when the local memca... [15:55:33] (03CR) 10Volans: [C: 03+1] "I like the new structure, it's much easier to follow and makes sense to me (although I don't have much knowledge of junos configs). One qu" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [15:55:34] RECOVERY - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [15:55:40] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2337.codfw.wmnet'] ` and were **ALL** successful. [15:56:24] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2339.codfw.wmnet ` The log can be... [15:57:42] RECOVERY - Prometheus prometheus1004/k8s restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [15:58:15] (03CR) 10Volans: cas-icinga: Add an entry point for the external monitoring script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [15:58:31] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) observe could be ok too, sorry didn't mean to make observation sound better or worse. netflow/observe event sounds a little weird but does seem consis... [15:59:21] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) ALSO these are just ideas and thoughts! Schemas in secondary repo SHOULD require less bikeshedding than those in primary :) [15:59:23] (03PS1) 10Ema: cache: make upload consume purges from kafka [puppet] - 10https://gerrit.wikimedia.org/r/604430 (https://phabricator.wikimedia.org/T133821) [16:00:15] (03PS2) 10Ema: cache: make upload consume purges from kafka [puppet] - 10https://gerrit.wikimedia.org/r/604430 (https://phabricator.wikimedia.org/T133821) [16:00:21] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:24] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:56] (03CR) 10Muehlenhoff: [C: 03+2] Update changelog [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/604416 (owner: 10Muehlenhoff) [16:01:07] (03CR) 10Ppchelko: [C: 03+1] cache: make upload consume purges from kafka [puppet] - 10https://gerrit.wikimedia.org/r/604430 (https://phabricator.wikimedia.org/T133821) (owner: 10Ema) [16:01:14] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) Netflow in theory is the name of the technology/protocol (see https://tools.ietf.org/html/rfc3954), and IIUC it defines a "flow" as the bytes/packets exc... [16:01:40] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.422e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [16:01:45] (03CR) 10Muehlenhoff: [C: 03+2] Update changelog (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/604416 (owner: 10Muehlenhoff) [16:01:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] cache: make upload consume purges from kafka [puppet] - 10https://gerrit.wikimedia.org/r/604430 (https://phabricator.wikimedia.org/T133821) (owner: 10Ema) [16:02:54] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:57] (03CR) 10Andrew Bogott: [C: 03+1] "looks good, thanks for the second pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:03:02] (03CR) 10Ema: [C: 03+2] cache: make upload consume purges from kafka [puppet] - 10https://gerrit.wikimedia.org/r/604430 (https://phabricator.wikimedia.org/T133821) (owner: 10Ema) [16:03:19] (03PS2) 10Dzahn: phabricator: change sender address of community_metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/603445 [16:06:06] !log cp3051: restart purged to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821 [16:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:12] T133821: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 [16:06:12] T250781: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 [16:08:51] 10Operations, 10ops-codfw, 10decommission, 10serviceops: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) @Dzahn yes i have to setup all the decom servers to offline [16:08:59] (03PS6) 10Gilles: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) [16:09:12] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2338.codfw.wmnet'] ` and were **ALL** successful. [16:09:42] (03CR) 10Aklapper: [C: 03+1] "Shrooog :)" [puppet] - 10https://gerrit.wikimedia.org/r/603445 (owner: 10Dzahn) [16:09:46] (03Abandoned) 10Bstorm: dumps-distribution: don't monitor systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [16:09:52] (03PS5) 10Privacybatm: transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) [16:09:54] (03PS3) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [16:10:19] (03CR) 10Gilles: Set expiry headers on thumbnails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [16:10:50] (03PS4) 10Privacybatm: [WIP] transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [16:11:17] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:59] (03PS2) 10Elukey: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [16:12:07] !log restart purged on all cache hosts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821 [16:12:08] 10Operations, 10Traffic, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) [16:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:13] T133821: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 [16:12:13] T250781: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 [16:12:24] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [16:12:24] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [16:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:33] (03PS1) 10Filippo Giunchedi: site: add thanos-fe1* to frontends [puppet] - 10https://gerrit.wikimedia.org/r/604433 (https://phabricator.wikimedia.org/T233956) [16:12:50] (03CR) 10jerkins-bot: [V: 04-1] Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [16:12:58] 10Operations, 10Traffic, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) I've updated the task descriptoin to include: * the grandfathering of action=rollback (as agreed a few years ago)... [16:13:00] !log correction: restart purged on all *cache_upload* hosts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821 [16:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:42] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Ah, makes more sense! Great. If pmacct is aggregating, perhaps summary is good in the name? [16:13:58] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:22] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add thanos-fe1* to frontends [puppet] - 10https://gerrit.wikimedia.org/r/604433 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [16:15:30] (03PS2) 10Filippo Giunchedi: site: add thanos-fe1* to frontends [puppet] - 10https://gerrit.wikimedia.org/r/604433 (https://phabricator.wikimedia.org/T233956) [16:16:38] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: make forward-zones and edns-subnet-whitelist optional [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:16:54] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10elukey) [16:17:13] sukhe: merging your patch too [16:17:23] godog: thank you! [16:18:01] np! {{done}} [16:18:32] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:18:32] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:07] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2339.codfw.wmnet'] ` and were **ALL** successful. [16:19:46] (03PS4) 10Rush: peek: Reenable cron with correct params [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [16:21:06] (03PS5) 10Rush: peek: Reenable cron with correct params [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [16:21:45] (03PS1) 10Ladsgroup: meet: Change the account manager socket [puppet] - 10https://gerrit.wikimedia.org/r/604434 (https://phabricator.wikimedia.org/T251034) [16:22:13] (03CR) 10Rush: [C: 03+2] peek: Reenable cron with correct params [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [16:22:39] (03CR) 10Rush: [C: 03+2] "thanks filippo, ariel, and jaime :)" [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [16:23:35] (03PS2) 10Bstorm: cloudstore: make systemd paging email only for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/601756 [16:25:01] (03CR) 10Bstorm: [C: 03+2] cloudstore: make systemd paging email only for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/601756 (owner: 10Bstorm) [16:26:01] (03CR) 10Ema: [C: 03+1] Disable HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [16:26:19] (03CR) 10Hnowlan: "Tested a version of this generated configuration on deployment-docker-cpjobqueue01.deployment-prep.eqiad.wmflabs, starts successfully." [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:26:48] * Pchelolo is taking over MW deploy for one more thing [16:27:12] (03CR) 10Ppchelko: [C: 03+2] Disable HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [16:27:52] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [16:27:52] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [16:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:03] (03Merged) 10jenkins-bot: Disable HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [16:28:36] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10MNovotny_WMF) I authorize this request! thank you! [16:28:53] (03CR) 10Rush: [C: 03+2] "seems cool to me, thanks dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/602602 (owner: 10Dzahn) [16:29:03] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [16:30:25] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) 05Open→03Resolved @Dzahn the 5 servers in C3 are ready for services [16:31:12] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable HTCP purges everywhere, gerrit:603655 (duration: 01m 05s) [16:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:31] 10Operations, 10Peek, 10Phabricator, 10Security-Team, 10Patch-For-Review: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10chasemp) `sudo -u peek crontab -l # HEADER: This file was autogenerated at... [16:31:37] 10Operations, 10Peek, 10Phabricator, 10Security-Team, 10Patch-For-Review: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10chasemp) 05Open→03Resolved [16:31:45] (03CR) 10Dzahn: [C: 03+2] meet: Change the account manager socket [puppet] - 10https://gerrit.wikimedia.org/r/604434 (https://phabricator.wikimedia.org/T251034) (owner: 10Ladsgroup) [16:32:12] (03PS2) 10Hnowlan: changeprop-jobqueue: add beta configuration skeleton [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) [16:36:59] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:38:06] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [16:38:38] (03PS3) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) [16:38:57] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Mholloway) [16:39:15] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Mholloway) [16:39:45] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Mholloway) [16:39:50] !log restart prometheus@ops in eqiad [16:39:57] !log EDIT: in esams [16:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:49] 10Operations, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.37; 2020-06-16): Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10Pchelolo) Big wikis are now done, but there's still a bit of long tail work left - Convert beta cluster to... [16:43:37] (03PS1) 10Andrew Bogott: keystone: add service user toolsbeta-dns-manager [puppet] - 10https://gerrit.wikimedia.org/r/604440 (https://phabricator.wikimedia.org/T252762) [16:44:08] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [16:46:26] (03Restored) 10Bstorm: dumps-distribution: don't monitor systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [16:46:32] English Wikisource just started getting reports of some visual glitches that, I'm guessing, are the result of an interaction between local site javascript and today's deploy. [16:47:22] (03CR) 10Andrew Bogott: [C: 03+2] keystone: add service user toolsbeta-dns-manager [puppet] - 10https://gerrit.wikimedia.org/r/604440 (https://phabricator.wikimedia.org/T252762) (owner: 10Andrew Bogott) [16:48:04] (03CR) 10Bstorm: "Restoring! The other patch was reverted because it causes more issues. This is the server set that actually triggered the WMCS desire for " [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [16:48:36] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:48:43] (03PS1) 10Ladsgroup: meet: Use python3 in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/604444 (https://phabricator.wikimedia.org/T251034) [16:49:08] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:49:25] (03PS1) 10Bstorm: cloudstore: disable systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/604445 [16:49:29] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:49:56] PROBLEM - Prometheus bast3004/ops restarted: beware possible monitoring artifacts on bast3004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [16:49:57] Sigh. So, no release notes for 1.35/wmf.36? [16:50:42] (03CR) 10Bstorm: "I had edited things to just email us because of I0c28fd2ce5c3620532f, but that patch was reverted. So I'm back to this." [puppet] - 10https://gerrit.wikimedia.org/r/604445 (owner: 10Bstorm) [16:51:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:51:27] xover: https://www.mediawiki.org/wiki/MediaWiki_1.35/wmf.36/Changelog [16:51:41] are you looking for something specific? [16:52:12] Majavah: Thanks. It's just the link from the last tech news that leads to an empty page. [16:52:24] (03PS4) 10Bstorm: dumps-distribution: don't monitor systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 [16:52:51] I'm looking for the likely cause of some visual glitches that just hit English Wikisource. [16:53:27] I'm guessing local javascript interacting badly with markup changes in wmf.36. [16:53:50] hmh, there's an issue with watchlist/recent changes that's known [16:54:16] PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:55:04] Not likely that one. [16:55:10] (03CR) 10Dzahn: [C: 03+2] meet: Use python3 in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/604444 (https://phabricator.wikimedia.org/T251034) (owner: 10Ladsgroup) [16:55:29] See the doubled up header here: https://en.wikisource.org/wiki/Ponsonby,_Henry_(DNB00) [16:56:07] The content comes from a template, but it's being moved around the DOM by code called via Common.js. [16:56:29] (long story, it kinda makes sense to do it that way. kinda.) [16:57:10] hmh i see [16:57:52] only happens in vector, so that's where I'd start looking first [16:58:41] Not seeing anything obvious in the changelog. [16:58:44] PROBLEM - Memcached on thanos-fe1001 is CRITICAL: connect to address 10.64.0.136 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:03] But tech news had a bunch of changes related to self-closing tags. [16:59:16] andrewbogott: merged your change [16:59:24] thanks [17:00:01] the thanos-fe1* alerts is me, known [17:01:16] PROBLEM - Thanos swift https on thanos-fe1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Thanos [17:01:20] xover: hmh I'm unfortunately out of clues, might be worth asking in #wikimedia-tech thru [17:02:40] (03PS1) 10Herron: icinga: include notification type in host alert email subjects [puppet] - 10https://gerrit.wikimedia.org/r/604448 [17:03:26] Hmm. I'm seeing two #mw-content-text in the output so something is borked right good and proper. [17:04:26] What's another English wiki that has wmf.36 currently that I can compare against? [17:04:40] (enwp is tomorrow, right?) [17:05:44] xover: wp is tomorrow, everything else should is on .36 [17:05:58] (03PS5) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [17:06:18] s/should is/is [17:07:39] 10Operations, 10netops: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10ayounsi) 05Open→03Resolved All done! [17:08:26] Hmm. Not seeing the same duplicate id on en.wikinews, so it's a symptom and not a trigger; which means the culprit is almost certainly our local scripts. [17:08:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:08:40] Thanks for the help, Majavah! [17:09:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:10:02] (03CR) 10Privacybatm: "This is in good shape for review." [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [17:10:56] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [17:11:28] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [17:11:42] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [17:12:20] RECOVERY - Prometheus bast3004/ops restarted: beware possible monitoring artifacts on bast3004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [17:12:47] (03PS1) 10Ssingh: wikidough: set up the pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/604452 (https://phabricator.wikimedia.org/T252132) [17:19:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:21:26] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Nuria) analytics/network/netflow/flowset? analytics/netflow/flowset? Flowset is how an "event" is called on the protocol linked by @elukey above [17:23:34] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1001/23148/" [puppet] - 10https://gerrit.wikimedia.org/r/604448 (owner: 10Herron) [17:23:49] (03PS2) 10Ssingh: wikidough: set up the pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/604452 (https://phabricator.wikimedia.org/T252132) [17:24:46] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) I don't think this needs to go in 'analytics', but flowset sounds nice if it is accurate. [17:25:38] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/23150/" [puppet] - 10https://gerrit.wikimedia.org/r/604452 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:26:36] ottomata: I know that we shouldn't bikeshed too much, just to understand - something like network/netflow/flowset is still not in the naming scheme that we want right? [17:27:10] 10Operations, 10Analytics, 10Analytics-Cluster, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Nuria) network/netflow/flowset? [17:28:37] elukey i do'nt mind you can drop the network/ part [17:31:39] of course wrong chan sorry, got the notification of the task in multiple places :D [17:32:34] (03CR) 10Ppchelko: "can you post the resulting config somewhere?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:37:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:52:41] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/604448 (owner: 10Herron) [17:58:29] jouncebot: next [17:58:29] In 0 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1800) [17:58:29] In 0 hour(s) and 1 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1800) [18:00:04] longma and liw: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1800). [18:00:05] MatmaRex: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:33] should i be concerned that there is something else scheduled at the same time as my backport window? :D [18:01:01] MatmaRex: that's triaging window rather than actual change :) [18:01:08] I can SWAT today! [18:01:15] (am I supposed to say that nowadays? :D) [18:01:54] thanks [18:01:55] actually...not merged yet MatmaRex [18:02:31] I'm uncomfortable with pushing code that's not in master out [18:03:09] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:03:20] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:03:41] Urbanecm: hmm, edsanders reviewed it, but we wanted to test it on mwdebug before merging to master [18:04:02] Urbanecm: we can probably just merge it if that makes your life easier. assuming ed is here [18:05:10] if it's impossible to test outside of prod and if the +1 has a +2-like meaning, I can do that [18:06:09] well i tested outside of prod, but we want to test in prod too :D because VE config settings are just terrible [18:06:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:07:18] okay, so let's go forward then [18:07:30] thanks [18:07:40] Urbanecm: also, the patch should only affect enwiki and eswiki, so let's test wmf.35 first [18:07:48] okay, ack [18:07:54] (03CR) 10Urbanecm: [C: 03+2] "backport" [extensions/VisualEditor] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/604138 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:08:03] merging wmf.35 [18:11:26] (03PS1) 10Ppchelko: Use EventRelayerNull for wikitech kafka purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604469 (https://phabricator.wikimedia.org/T250781) [18:11:54] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:12:06] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:12:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:34] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) > `name=Error message > Fatal error: > Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because... [18:24:14] (03Merged) 10jenkins-bot: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor [extensions/VisualEditor] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/604138 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:26:40] MatmaRex: available at mwdebug1001, wmf.35 only [18:27:23] thanks, looking [18:28:07] (03PS4) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) [18:29:26] Urbanecm: everything seems good [18:30:20] okay, syncing [18:31:13] (03CR) 10Urbanecm: [C: 03+2] "backport" [extensions/VisualEditor] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604139 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:32:12] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/VisualEditor/: 5f4c609: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor (T253941) (duration: 01m 14s) [18:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:16] T253941: [Regression] VisualEditor not available in WikiEditor's toolbar for logged out users at en.wiki - https://phabricator.wikimedia.org/T253941 [18:32:20] MatmaRex: .35 is synced [18:33:32] (03CR) 10Krinkle: [C: 03+1] Use EventRelayerNull for wikitech kafka purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604469 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:34:04] ok [18:39:14] Urbanecm: can you ping me when done deploying? I wanna sneak one more into this window [18:39:27] Pchelolo: certainly. I'm now waiting for CI [18:47:43] hello, what's up? [18:48:26] (03Merged) 10jenkins-bot: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor [extensions/VisualEditor] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604139 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:50:28] edsanders: nothing any more, Urbanecm was concerned that the VE patch ^ i wanted backported wasn't merged into master, but we figured it out [18:50:50] k [18:51:28] MatmaRex: wmf.36 is at mwdebug1001 too [18:52:23] Urbanecm: it looks good, although i can't test much since we don't use the affected setting on any wiki [18:52:31] gotcha, syncing then [18:53:19] (03PS3) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) [18:53:52] (03PS2) 10Herron: icinga: include notification type in host alert email subjects [puppet] - 10https://gerrit.wikimedia.org/r/604448 [18:54:10] (03CR) 10jerkins-bot: [V: 04-1] Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [18:54:12] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/VisualEditor/: 8958860: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor (T253941) (duration: 01m 07s) [18:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:16] T253941: [Regression] VisualEditor not available in WikiEditor's toolbar for logged out users at en.wiki - https://phabricator.wikimedia.org/T253941 [18:55:22] MatmaRex: synced [18:55:25] Pchelolo: floor is yours [18:55:32] thank you! [18:55:32] thanks Urbanecm! [18:55:38] happy to help! [18:56:42] (03CR) 10Ppchelko: [C: 03+2] Use EventRelayerNull for wikitech kafka purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604469 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:57:47] (03Merged) 10jenkins-bot: Use EventRelayerNull for wikitech kafka purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604469 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:59:13] (03PS4) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) [19:00:04] longma and liw: (Dis)respected human, time to deploy Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T1900). Please do the needful. [19:01:09] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use EventRelayerNull for wikitech, gerrit:604469 (duration: 01m 05s) [19:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:46] (03PS5) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) [19:03:41] (03CR) 10Krinkle: [C: 03+1] Use EventRelayerNull for wikitech kafka purges (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604469 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [19:04:05] Pchelolo: need a follow-up there as it is currently risky and might fail again in the future. core requires the key to exist. [19:05:37] (03CR) 10Krinkle: "Also add stubs for these in PrivateSettings.example.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [19:05:56] Krinkle: after next train we are going to move wikitech to kafka and clean up all the configs [19:06:06] waiting for some code change to land [19:06:47] Pchelolo: ok, but I mean, if for some reason we need to disable this in the future, it should be safe to remove 'cdn-urls-purge' from the config. right now doing that would cause an exception. because it overides the core config. [19:07:23] and e.g. we might introduce for a thrid-party some other relayer feature and it would work until we see in production [19:07:57] ok. will add 'EventRelayerNull' as default for everything [19:10:09] (03PS6) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) [19:14:14] (03CR) 10Krinkle: [WIP] webperf: Remove XHGui dependency on MongoDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [19:15:27] (03CR) 10Bstorm: "Looks good. If you are going to do the testing, Bryan, don't wait on me." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [19:26:16] (03PS3) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [19:41:02] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [19:45:41] 10Operations, 10ops-codfw, 10decommission, 10serviceops: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) 05Open→03Resolved Complete [19:46:37] !log bouncing elasticsearch on logstash1011 [19:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:37] Oh, no, I take it back. Commons also has duplicate #mw-content-text nodes. It's now looking more like something broken in Vector in wmf.36. [19:49:36] (and if that's the case it potentially breaks lots of tools, gadgets, user scripts, etc. when it hits enwp tomorrow) [19:53:11] (03PS2) 10Ppchelko: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) [19:54:05] (03CR) 10Ppchelko: [C: 04-2] "Blocked until purged is ready in beta to listen to these" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [19:54:08] (03CR) 10jerkins-bot: [V: 04-1] Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [19:55:00] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install icinga2002 - https://phabricator.wikimedia.org/T255070 (10RobH) [19:55:04] (03PS3) 10Ppchelko: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) [19:55:14] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install icinga2002 - https://phabricator.wikimedia.org/T255070 (10RobH) [19:56:51] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Jclark-ctr) @elukey I am on site every tuesday and thursday. usually arrive at 9:00am est message me on irc to workout a schedule that... [19:56:54] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install icinga1002 - https://phabricator.wikimedia.org/T255072 (10RobH) [19:56:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install icinga1002 - https://phabricator.wikimedia.org/T255072 (10RobH) [19:57:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install icinga1002 - https://phabricator.wikimedia.org/T255072 (10RobH) [19:57:33] xover: i filed https://phabricator.wikimedia.org/T255073 [19:58:03] MatmaRex: Thank you! [20:00:04] halfak and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T2000). [20:03:25] (03PS1) 10Dave Pifke: Add passwords for labs XHGui database [labs/private] - 10https://gerrit.wikimedia.org/r/604498 (https://phabricator.wikimedia.org/T180761) [20:04:18] !log mbsantos@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:17] (03CR) 10Cwhite: [C: 03+1] icinga: include notification type in host alert email subjects [puppet] - 10https://gerrit.wikimedia.org/r/604448 (owner: 10Herron) [20:06:34] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [20:12:06] (03CR) 10Jbond: cas-icinga: Add an entry point for the external monitoring script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [20:15:16] !log mbsantos@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:00] (03PS2) 10BryanDavis: Add passwords for Cloud VPS XHGui database [labs/private] - 10https://gerrit.wikimedia.org/r/604498 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [20:23:32] Trying to deploy mobileapps node service do k8s but helm apply is hanging and never complete [20:24:01] Looking into helm status the pod is with status `CrashLoopBackOff` [20:25:49] mateusbs17: hmm, maybe best to stop for now and try again when akosiaris is around [20:26:03] (which i'm guessing isn't the case right now, since it's rather late) [20:27:21] mateusbs17: I can try and help you determine the cause if that's helpful [20:27:39] longma: that would be awesome [20:28:22] It ultimately failed with: [20:28:26] https://www.irccloud.com/pastebin/btG33Ux1/ [20:29:08] I am going to try to deploy again. Did you use helmfile to deploy? [20:29:48] I was following the same steps we use for wikifeeds, but for mobileapps instead, see https://www.mediawiki.org/wiki/Wikifeeds/Deployment_Process [20:30:12] Yes, I'm use helmfile [20:30:51] thanks [20:33:20] !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:57] I think it might be missing some cert config secrets, but I'm not sure if they are needed [20:42:09] longma: about config secrets https://phabricator.wikimedia.org/T225680#6210352 [20:43:04] (03PS2) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [20:43:17] longma: it's missing the certificates for the TLS enabled endpoint [20:43:51] there was an error from the pod about NetworkPlugin cni failed to set up pod [20:44:20] let me see why but it's indeed the mobileapps-staging-tls-proxy container that has Reason: CrashLoopBackOff [20:44:29] the CNI error is transient, it can be ignored [20:45:05] ah okay [20:45:33] doesn't look like, I know :(. I am hoping our upgrade of calico to a newer version will fix that [20:45:48] !log mbsantos@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' . [20:45:49] it says a container name must be specified for the pod [20:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:13] `Error from server (BadRequest): a container name must be specified for pod mobileapps-staging-7fcc678577-f6vk6, choose one of: [mobileapps-staging staging-metrics-exporter mobileapps-staging-tls-proxy]` [20:46:49] longma: that's probably a kubectl logs command [20:47:04] oh right [20:47:05] lol [20:47:08] and it needs to know which container to fetch the logs for [20:47:14] my bad [20:47:26] not used to having multiple containers [20:49:37] I 'll create certs for both proton and mobileapps right now [20:49:57] I 'll need 10mins mateusbs17, mdholloway [20:50:20] shall I go ahead and delete this deployment? [20:50:38] it should rollback on its own [20:50:48] akosiaris: sure no problem, I was just confirming the error was only happening in mobileapps [20:50:51] we now have atomic: true [20:50:58] awesome [21:03:23] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [21:06:22] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [21:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:27] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [21:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:11] LAST DEPLOYED: Wed Jun 10 21:08:28 2020 [21:09:11] NAMESPACE: mobileapps [21:09:11] STATUS: DEPLOYED [21:09:20] looks like they are running now 👍 [21:09:21] mdholloway: mateusbs17 ^ [21:09:47] akosiaris: awesome, thanks! [21:09:51] \o/ [21:10:09] akosiaris: can I run proton deploy now? [21:10:16] mateusbs17: yes, please do :-) [21:11:09] !log mbsantos@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' . [21:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:39] 26m Warning FailedCreate ReplicaSet Error creating: pods "chromium-render-production-595c74dc45-8gdqr" is forbidden: [maximum memory usage per Container is 2Gi, but limit is 3Gi., maximum cpu usage per Container is 4, but limit is 8., maximum cpu usage per Pod is 4, but limit is 8600m., maximum memory usage per Pod is 3Gi, but limit is 3850371072.] [21:12:47] sigh, /me fixing [21:18:10] (03PS3) 10Alexandros Kosiaris: Bump chart versions for netpol bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/602706 [21:18:12] (03PS1) 10Alexandros Kosiaris: admin: Increase pod/container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/604511 [21:18:45] (03PS2) 10Alexandros Kosiaris: admin: Increase pod/container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/604511 [21:18:53] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] admin: Increase pod/container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/604511 (owner: 10Alexandros Kosiaris) [21:23:11] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' . [21:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:55] !log increase memory/cpu limits for proton [21:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:13] LAST DEPLOYED: Wed Jun 10 21:23:12 2020 [21:25:13] NAMESPACE: proton [21:25:13] STATUS: DEPLOYED [21:25:19] mateusbs17: ^ [21:25:33] akosiaris: thank you so much! [21:26:09] yw. let me know if you need anything else. The other 2 clusters (eqiad, codfw) have also been updated, you should be ok to deploy as well [21:32:13] (03PS1) 10Alexandros Kosiaris: Renumber 3 kubernetes etcd nodes [dns] - 10https://gerrit.wikimedia.org/r/604512 [21:32:15] (03PS1) 10Alexandros Kosiaris: Cleanup some old leftovers [dns] - 10https://gerrit.wikimedia.org/r/604513 [21:32:40] (03CR) 10jerkins-bot: [V: 04-1] Renumber 3 kubernetes etcd nodes [dns] - 10https://gerrit.wikimedia.org/r/604512 (owner: 10Alexandros Kosiaris) [21:32:42] (03CR) 10jerkins-bot: [V: 04-1] Cleanup some old leftovers [dns] - 10https://gerrit.wikimedia.org/r/604513 (owner: 10Alexandros Kosiaris) [21:34:01] (03CR) 10Volans: "The refactor looks good to me, thanks for addressing it. From a quick test the generated files are identical to those in production right " (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [21:34:33] (03PS2) 10Alexandros Kosiaris: Renumber 3 kubernetes etcd nodes [dns] - 10https://gerrit.wikimedia.org/r/604512 [21:34:35] (03PS2) 10Alexandros Kosiaris: Cleanup some old leftovers [dns] - 10https://gerrit.wikimedia.org/r/604513 [21:36:12] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.0269 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [21:37:23] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10KFrancis) @MNovotny_WMF Do you have the fully executed internship agreement? If so, the fully executed inter... [21:38:43] (03CR) 10Volans: [C: 03+1] "LGTM, minor docstring nits inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604411 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [21:53:25] (03PS1) 10RhinosF1: Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 [22:03:15] (03CR) 10Volans: "Did a first pass, few comments inline. The generic structure is ok and is well commented to follow the logic." (0315 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [22:06:33] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) [22:06:42] (03PS2) 10RhinosF1: Add NamespaceAliases for kowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604515 (https://phabricator.wikimedia.org/T255031) [22:08:11] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) [22:09:14] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) I've updated the task description with the newer template that was added. The only thing not... [22:09:34] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) [22:10:35] (03PS2) 10BryanDavis: Remove validation of Kubernetes self-signed API cert [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/598109 (https://phabricator.wikimedia.org/T253412) [22:24:28] (03PS1) 10Jdlrobson: Don't modify OutputPage value [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604520 (https://phabricator.wikimedia.org/T255073) [22:34:39] (03PS7) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [22:34:50] jouncebot: now [22:34:50] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [22:34:53] jouncebot: next [22:34:54] In 0 hour(s) and 25 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T2300) [22:34:58] James_F: o/ [22:35:13] (03PS1) 10Addshore: FP: Improve EntityLinkTargetEntityIdLookup exception message [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604524 (https://phabricator.wikimedia.org/T255078) [22:38:49] If anyone fancies backporting that one to make debugging a train blocker easier in the morning that would be grand! bu I'm off the bed now! [22:38:52] (03PS8) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [22:39:25] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [22:40:56] (03CR) 10BryanDavis: [C: 04-2] Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [22:43:53] (03PS9) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [22:44:51] (03CR) 10DannyS712: [C: 03+1] "LGTM" [extensions/Wikibase] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604524 (https://phabricator.wikimedia.org/T255078) (owner: 10Addshore) [22:46:35] (03CR) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [22:51:39] (03PS3) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) [22:53:24] (03PS4) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) [22:54:08] (03CR) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [22:56:41] (03CR) 10Bstorm: "At least one puppetmaster that does not have puppetdb passes the compiler with this: https://puppet-compiler.wmflabs.org/compiler1001/2315" [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200610T2300). [23:00:05] Jdlrobson: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:38] o/ [23:01:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:02:18] (03CR) 10Bstorm: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [23:03:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:04:21] RoanKattouw: around? [23:04:26] Yes sorry [23:06:21] (03CR) 10BryanDavis: "Tested by manually calling webservice-runner. When it gets --canonical then the alias is commented out. Testing "live" in toolsbeta is tri" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [23:06:42] (03CR) 10Catrope: [C: 03+2] Don't modify OutputPage value [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604520 (https://phabricator.wikimedia.org/T255073) (owner: 10Jdlrobson) [23:12:01] RoanKattouw: thanks [23:16:38] (03CR) 10Bstorm: [C: 03+1] "naggen2 *does* get installed to cloud servers, and this will replace requests with the python3 version on them, but that seems ok." [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [23:19:36] (03CR) 10Bstorm: [C: 03+2] cloudstore: disable systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/604445 (owner: 10Bstorm) [23:22:49] (03Merged) 10jenkins-bot: Don't modify OutputPage value [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604520 (https://phabricator.wikimedia.org/T255073) (owner: 10Jdlrobson) [23:31:27] RoanKattouw: debug1 or 2? [23:32:35] Oh sorry, got distracted [23:41:49] .. still distracted? :P [23:41:57] (03CR) 10Bstorm: "My only question about this is re: the commit msg. It says it makes all subclasses aware of the canonical arg, but it seems to leave out j" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [23:48:09] gulp [23:49:09] (03CR) 10Bstorm: [C: 03+2] toolsdb: add more temporary filters for replication [puppet] - 10https://gerrit.wikimedia.org/r/602433 (https://phabricator.wikimedia.org/T253738) (owner: 10BryanDavis) [23:50:04] (03CR) 10Bstorm: [C: 03+1] Remove validation of Kubernetes self-signed API cert [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/598109 (https://phabricator.wikimedia.org/T253412) (owner: 10BryanDavis) [23:53:43] RoanKattouw: around? [23:53:48] i need to go soon [23:53:59] Jdlrobson: Ugh sorry. It's on mwdebug1002 but I forgot to tell you [23:54:23] works after a purge [23:54:26] cool sync away [23:55:35] So sorry about that Jdlrobson , CI takes forever on these patches and doesn't notify me when it's done [23:55:39] syncing now [23:55:46] no worries RoanKattouw completely understand [23:55:54] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.36/includes/skins/SkinTemplate.php: T255073 (duration: 01m 07s) [23:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:58] T255073: Duplicate
in Vector HTML output - https://phabricator.wikimedia.org/T255073 [23:58:35] thanks RoanKattouw !