[00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200124T0000). [00:00:05] AndyRussG and arlolra: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:08:22] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [00:11:35] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [00:11:39] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [00:15:14] I await my sticker [00:15:14] o/ [00:15:17] sorry late [00:15:32] RoanKattouw: Niharika: Urbanecm: anyone swatting? [00:15:36] (03PS1) 10Dzahn: ssl: update TLS cert for etherpad, added etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566899 (https://phabricator.wikimedia.org/T224580) [00:17:05] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [00:18:01] arlolra: did i miss the fun? [00:18:28] nope [00:18:32] (i feel like i should have a collection of stickers, but I'm not sure i want to claim them.) [00:18:53] arlolra: no one is SWATting? [00:19:10] (i mean, breaking this wikis just a little bit for a little while for some users should count for a little sticker, right?) [00:19:21] no one has answered your call, no [00:19:36] catrope, Niharika, Urbanecm are listed on the Deployments page [00:19:39] temporary tattoo [00:19:48] i'm just repeating what AndyRussG said of course [00:19:55] but maybe if we say their names three times they'll appear [00:20:02] heheh [00:20:21] I do have deploy rights on the cluster, and it'd be nice to push out the CN thing this evening [00:20:34] does jouncebot saying their names count? i never know how the occult interacts with AI [00:20:41] however it's been a while and I'm not sure I feel comfortable deploying other stuff [00:21:32] brennen twentyafterfour did the train roll today? [00:22:42] AndyRussG: it was reverted [00:22:54] because after that deploy CPU usage went up quite a bit on all appservers [00:23:05] or that's the last i saw [00:23:16] i pinged RoanKattouw on gchat [00:23:34] So sorry for missing pings, I broke a glass in the kitchen and was very distracted for a while [00:23:40] mutante: ah hmmmm [00:23:45] I'll be at my computer and ready to run the SWAT in a few minutes [00:23:46] i've settled my questoin! [00:23:48] RoanKattouw: oh hey no worries that can have tragic consequences [00:23:56] chanting someone's name three times works, but jouncebot doesn't count. [00:24:16] i guess that definitively proves jouncebot doesn't have a soul? [00:24:16] I'm terrified of breaking a glass or ceramic thing with food in it while my dog is in the kitchen [00:24:23] !rain_dance is http://catb.org/jargon/html/R/rain-dance.html [00:24:23] Key was added [00:24:44] RoanKattouw: thanks much!! actually I was super late too [00:25:15] i was also 17 minutes late. i think arlolra was on time, though. maybe he could get a sticker. [00:25:26] hopefully not an "i'm about the break the wikis" sticker [00:25:31] *about to [00:25:53] i'm about to break the wikis but it's ok because cscott is here to clean it up? [00:27:02] enwiki is indeed on wmf.15 still [00:27:13] do we each get half of the sticker then? [00:27:57] one can get the scratch half and the other can get the sniff half (scratch 'n' sniff stickers) [00:28:29] (03PS3) 10Catrope: Bump Parsoid/PHP cluster memory_limit again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [00:28:40] (03CR) 10Catrope: [C: 03+2] Bump Parsoid/PHP cluster memory_limit again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [00:28:50] Ah thanks RoanKattouw... I was about to ask if I should make some branch changes in Gerrit [00:29:00] (03CR) 10Dzahn: [C: 03+2] ssl: update TLS cert for etherpad, added etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566899 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [00:29:11] (03PS2) 10Dzahn: ssl: update TLS cert for etherpad, added etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566899 (https://phabricator.wikimedia.org/T224580) [00:29:23] RoanKattouw: sadly I think those are gonna fail CI [00:29:26] that was my other question [00:29:34] Oh hm [00:29:43] the patch referenced on the Deployments page is just the functional change we want to push out [00:29:45] It's a simple JS change so I can just force-merge it [00:29:53] if you're good with that, yes! [00:30:03] there's just a slew of other CI related changes that were all cherry-picked [00:30:03] Yeah it's failing [00:30:07] (03Merged) 10jenkins-bot: Bump Parsoid/PHP cluster memory_limit again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [00:31:13] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:14] the CI and CI-fixing changes were added precisely so that CI wouldn't break for the SWAT change, but yes the one you have there is just a wee JS one [00:31:33] RoanKattouw: also we need wmf.15 pls, since, as I just learned, the train was reverted [00:31:42] OK will do [00:31:53] thx!!!! [00:32:15] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit again (T239806, T236833) (duration: 01m 05s) [00:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:20] T239806: Parsoid/PHP errors - https://phabricator.wikimedia.org/T239806 [00:32:20] T236833: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 [00:33:08] so... how do we check that the memory_limit is actually higher now [00:33:23] ;-p [00:33:35] !log cp4032 - starting varnishmtail.service which was failed [00:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:01] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:48] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.15/extensions/CentralNotice/resources/ext.centralNotice.display/hide.js: T240802 (duration: 01m 07s) [00:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:51] T240802: Campaign fallen back to gets incorrect status code and may be incorrectly hidden - https://phabricator.wikimedia.org/T240802 [00:35:53] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/CentralNotice/resources/ext.centralNotice.display/hide.js: T240802 (duration: 01m 05s) [00:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:09] cscott: after removing the filtering I see 1468006400 instead of 1073741824 [00:36:21] RoanKattouw: thanks [00:36:55] oh, the OOM crash shows total memory at crash time. good, that's convenient. [00:37:03] so mission accomplished i guess. :) [00:38:32] RoanKattouw: thanks! [00:38:44] no death or destruction so far [00:40:32] AndyRussG: so it didn't work? [00:41:29] arlolra: it did work [00:41:43] I mean, the new code is out and nothing appears to have died [00:41:43] and yes, no death or destruction [00:41:54] s/yes/yet/ [00:41:56] arlolra: I mean, the new CentralNotice code [00:41:58] :) [00:42:00] I don't know about anything else [00:42:41] sorry, I'm not being helpful, I was suggestion death and destruction were the positive outcome ... my humour failed [00:43:23] arlolra: hehe I got it, all good [00:43:25] :) [00:43:50] (03PS3) 10Dzahn: ssl: update TLS cert for etherpad, add etherpad1002, etherpad-new [puppet] - 10https://gerrit.wikimedia.org/r/566899 (https://phabricator.wikimedia.org/T224580) [00:43:50] :D [00:45:07] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:30] RoanKattouw: thanks again! hope the remaining glasses stay intact eh [00:45:39] haha thanks [00:46:13] It was a glass in the office kitchen, so on the plus side, it's not my glass; on the downside, the glass on the floor affects everyone [00:46:22] (It was vacuumed up but then Moriel found more pieces) [00:46:50] !log cp4032 - starting varnishmtail.service [00:46:51] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:42] RoanKattouw it is your glass... it's everyone's glass...... [00:47:51] (03CR) 10Dzahn: [C: 03+2] ssl: update TLS cert for etherpad, add etherpad1002, etherpad-new [puppet] - 10https://gerrit.wikimedia.org/r/566899 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [00:47:52] jk 8p [00:48:00] (03PS4) 10Dzahn: ssl: update TLS cert for etherpad, add etherpad1002, etherpad-new [puppet] - 10https://gerrit.wikimedia.org/r/566899 (https://phabricator.wikimedia.org/T224580) [00:52:41] (03PS1) 10Dzahn: add etherpad.discovery.wmnet, point to etherpad1001 [dns] - 10https://gerrit.wikimedia.org/r/566905 [00:52:43] (03PS1) 10Dzahn: switch discovery record for etherpad from 1001 to 1002 [dns] - 10https://gerrit.wikimedia.org/r/566906 (https://phabricator.wikimedia.org/T224580) [00:53:46] (03CR) 10Dzahn: "I noticed the envoy cert has the discovery.wmnet name in it but we are not using it. Unless it was removed again for some reason? Adding i" [dns] - 10https://gerrit.wikimedia.org/r/566905 (owner: 10Dzahn) [00:55:49] (03PS3) 10Dzahn: trafficserver/cache: add etherpad-new -> etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566636 (https://phabricator.wikimedia.org/T224580) [01:06:48] (03PS1) 10Dzahn: trafficserver/varnish: use discovery record for etherpad, drop etherpad1001 director [puppet] - 10https://gerrit.wikimedia.org/r/566907 [01:08:04] (03CR) 10jerkins-bot: [V: 04-1] trafficserver/varnish: use discovery record for etherpad, drop etherpad1001 director [puppet] - 10https://gerrit.wikimedia.org/r/566907 (owner: 10Dzahn) [01:09:47] (03PS2) 10Dzahn: trafficserver/varnish: use discovery record for etherpad, drop etherpad1001 director [puppet] - 10https://gerrit.wikimedia.org/r/566907 [01:10:43] (03CR) 10jerkins-bot: [V: 04-1] trafficserver/varnish: use discovery record for etherpad, drop etherpad1001 director [puppet] - 10https://gerrit.wikimedia.org/r/566907 (owner: 10Dzahn) [01:15:04] (03PS3) 10Dzahn: trafficserver/varnish: use discovery for etherpad, drop etherpad1001 director [puppet] - 10https://gerrit.wikimedia.org/r/566907 [01:17:48] (03CR) 10Dzahn: [C: 03+2] trafficserver/cache: add etherpad-new -> etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566636 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [01:24:57] !log running puppet on cp-text_ulsfo [01:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:38] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10Dzahn) @akosiaris @Muehlenhoff Here it is on buster as "etherpad-new". https://etherpad-new.wikimedia.org/p/aXjrQTK8PD6bjj9TqK4Q [01:27:40] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10Dzahn) a:03Dzahn [01:42:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 43 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:45:57] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [01:48:15] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 29 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:54:46] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [01:58:13] (03PS1) 10Paladox: Gerrit: Upgrade to python3 (bazel requires it) [software/gerrit] - 10https://gerrit.wikimedia.org/r/566914 [01:58:24] (03Abandoned) 10Paladox: Gerrit: Upgrade to python3 (bazel requires it) [software/gerrit] - 10https://gerrit.wikimedia.org/r/566914 (owner: 10Paladox) [02:08:41] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:07] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:07] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [03:20:52] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for wdqs200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/566920 [04:07:42] (03CR) 10BryanDavis: [C: 03+2] Switch to pytest [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561793 (owner: 10Legoktm) [04:07:48] (03CR) 10BryanDavis: [C: 03+2] Simplify tox configuration by using tox-wikimedia [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561794 (owner: 10Legoktm) [04:12:18] (03Merged) 10jenkins-bot: Switch to pytest [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561793 (owner: 10Legoktm) [04:12:20] (03Merged) 10jenkins-bot: Simplify tox configuration by using tox-wikimedia [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561794 (owner: 10Legoktm) [04:20:54] (03CR) 10BryanDavis: [C: 03+2] Rewrite webservice-python-bootstrap in Python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561795 (owner: 10Legoktm) [04:20:59] (03CR) 10BryanDavis: [C: 03+2] Add --fresh to webservice-python-bootstrap [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561802 (owner: 10Legoktm) [04:21:11] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:25] (03Merged) 10jenkins-bot: Rewrite webservice-python-bootstrap in Python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561795 (owner: 10Legoktm) [04:21:31] (03Merged) 10jenkins-bot: Add --fresh to webservice-python-bootstrap [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561802 (owner: 10Legoktm) [04:32:28] (03PS1) 10Ammarpad: Add assigment of 'mover' group to bureaucrats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) [04:42:59] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:42] (03CR) 10BryanDavis: [C: 03+1] "Untested, but the code reads well." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [05:04:31] (03PS7) 10BryanDavis: Report error messages on stderr [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 [05:04:33] (03PS7) 10BryanDavis: Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 [05:04:35] (03PS7) 10BryanDavis: Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 [05:04:37] (03PS6) 10BryanDavis: Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 [05:04:39] (03PS3) 10BryanDavis: Deprecate Jessie based Kubernetes types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565807 [05:04:41] (03PS8) 10BryanDavis: kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 [05:04:43] (03PS14) 10BryanDavis: Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:05:55] (03CR) 10BryanDavis: [C: 04-2] "Hold on this until we are done with https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (owner: 10BryanDavis) [05:06:29] (03CR) 10BryanDavis: [C: 04-2] "Hold until we are done with https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:26:19] PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [05:27:07] RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [06:11:39] 10Operations, 10ops-codfw, 10DBA: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 (10Marostegui) For the record: @Papaul replaced DIMM A3 with a new one sent by Dell [06:12:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2085 after memory replacement T243148', diff saved to https://phabricator.wikimedia.org/P10256 and previous config saved to /var/cache/conftool/dbconfig/20200124-061228-marostegui.json [06:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:34] T243148: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 [06:12:48] 10Operations, 10ops-codfw, 10DBA: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 (10Marostegui) 05Open→03Resolved a:05Marostegui→03Papaul [06:37:23] !log Stop replication on db1107 [06:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:25] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:32] (03PS1) 10Marostegui: install_server: Do not reimage es4/es5 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/566930 (https://phabricator.wikimedia.org/T243052) [06:42:21] 10Operations, 10Wikimedia-Mailing-lists: Forget Password for Usjp-wikiclub mailing list - https://phabricator.wikimedia.org/T243578 (10Galib) [06:43:28] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es4/es5 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/566930 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [06:44:51] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:13] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:13] 10Operations, 10Wikimedia-Mailing-lists: Reset password for Usjp-wikiclub mailing list - https://phabricator.wikimedia.org/T243578 (10Ammarpad) [07:13:51] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:35] <_joe_> !log force run puppet on all esams cache nodes, for mitigation of T243313 [07:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:39] T243313: Fatal WMFTimeoutException for ApiComparePages requests - https://phabricator.wikimedia.org/T243313 [07:24:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:26:39] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:59] (03PS1) 10Marostegui: db1097: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/566992 (https://phabricator.wikimedia.org/T239453) [07:30:50] (03CR) 10Marostegui: [C: 03+2] db1097: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/566992 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [07:44:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] build: Run commit-message-validator under Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/563116 (owner: 10Legoktm) [07:44:44] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:23] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10akosiaris) >>! In T224580#5828562, @Dzahn wrote: > @akosiaris @Muehlenhoff Here it is on buster as "etherpad-new". https://etherpad-new.wikimedia.org/p/aXjr... [08:01:55] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:17:09] !log installing python-apt security updates [08:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:41] (03CR) 10Ema: [C: 03+1] Replace /var/run with /run [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [08:42:24] !log Remove wikiadmin2 user from pc2XXX codfw hosts T243512 [08:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:28] T243512: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 [08:51:32] (03PS3) 10Joal: Update labstore mediawiki-history readme file [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) [08:51:34] (03CR) 10Joal: Update labstore mediawiki-history readme file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) (owner: 10Joal) [09:01:30] (03PS2) 10Alexandros Kosiaris: admin: DRY environments by using a common one [deployment-charts] - 10https://gerrit.wikimedia.org/r/566816 [09:01:32] (03PS1) 10Alexandros Kosiaris: rbac: Move under common/ directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/566998 [09:01:34] (03PS1) 10Alexandros Kosiaris: admin: DRY podsecuritypolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/566999 [09:01:36] (03PS1) 10Alexandros Kosiaris: admin: Align staging symlink with production clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567000 [09:01:38] (03PS1) 10Alexandros Kosiaris: admin: get rid of apply-calico-policy.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/567001 [09:01:40] (03PS1) 10Alexandros Kosiaris: admin: Realign calico policies between clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567002 [09:03:33] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:13:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/20557/" [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [09:15:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/20558/" [puppet] - 10https://gerrit.wikimedia.org/r/566736 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [09:20:15] (03PS7) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [09:22:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [09:23:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: refactor elastic role/profile into modern layout [puppet] - 10https://gerrit.wikimedia.org/r/566704 (https://phabricator.wikimedia.org/T236606) (owner: 10Arturo Borrero Gonzalez) [09:25:44] (03CR) 10Alexandros Kosiaris: [C: 04-2] "etherpad can't really support >1 instances running. See https://github.com/ether/etherpad-lite/issues/2826 for more information. Adding a " [dns] - 10https://gerrit.wikimedia.org/r/566905 (owner: 10Dzahn) [09:26:15] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Same comment as for https://gerrit.wikimedia.org/r/#/c/operations/dns/+/566905/1" [dns] - 10https://gerrit.wikimedia.org/r/566906 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:26:36] (03PS2) 10Alexandros Kosiaris: remove etherpad-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/566638 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:26:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] remove etherpad-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/566638 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:27:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [09:27:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Removing the record to avoid any unwanted corruption issues as pointed out in https://phabricator.wikimedia.org/T224580#5828883" [dns] - 10https://gerrit.wikimedia.org/r/566638 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:29:43] !log disable and mask etherpad-lite on etherpad1002 to avoid corruption issues. T224580 [09:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:47] T224580: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 [09:31:46] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10akosiaris) I 've removed the DNS and stopped and masked the service for now on etherpad1002. Since we proved it works, let's just move over to etherpad1002.eqi... [09:33:07] PROBLEM - etherpad_up reduced availability on icinga1001 is CRITICAL: 0.5 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:33:27] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [09:33:35] PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [09:38:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] trafficserver/cache: switch backend for etherpad to etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566637 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:38:21] (03PS2) 10Alexandros Kosiaris: trafficserver/cache: switch backend for etherpad to etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566637 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:39:53] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:44] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10akosiaris) Pad that per logs have been accessed on https://etherpad-new.wikimedia.org ` 90D1o-quuUNWqCrt0CIV WMCS-2019-06-25 WMCS-2020-01-22 WMCS-2020-02-04 W... [09:43:31] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:41] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8963 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [09:49:51] RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [09:50:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] site: remove etherpad1001 [puppet] - 10https://gerrit.wikimedia.org/r/566635 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:50:19] (03PS2) 10Alexandros Kosiaris: site: remove etherpad1001 [puppet] - 10https://gerrit.wikimedia.org/r/566635 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [09:51:13] RECOVERY - etherpad_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:58:52] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10akosiaris) @Dzahn, I 've merged the required remaining changes to get the migration done. Now etherpad.wikimedia.org uses etherpad1002. Checked a couple of pad... [09:59:15] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: connect to address 10.64.32.177 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [10:00:17] PROBLEM - etherpad_up reduced availability on icinga1001 is CRITICAL: 0.5 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:00:23] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [10:02:19] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/566858 (owner: 10Ori.livneh) [10:04:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] Update flamegraph.pl to brendangregg/Flamegraph@1a0dc6985a [puppet] - 10https://gerrit.wikimedia.org/r/566858 (owner: 10Ori.livneh) [10:07:29] RECOVERY - etherpad_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:10:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: DRY environments by using a common one [deployment-charts] - 10https://gerrit.wikimedia.org/r/566816 (owner: 10Alexandros Kosiaris) [10:13:20] (03PS2) 10Vgutierrez: Release 8.0.5-1wm14 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) [10:13:31] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm14 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [11:04:45] !log restart php-fpm on mw1238-mw1239 [11:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:34] !log purged stale grafana package from grafana1001, caused systemd unit failure [11:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:25] (03PS1) 10Vgutierrez: ATS: Add missing PIDFile for non-default instances [puppet] - 10https://gerrit.wikimedia.org/r/567009 [11:35:23] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) [11:37:36] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:02] uh [11:43:04] * vgutierrez checking [11:44:48] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:03] duh [11:48:36] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10Vgutierrez) [11:50:56] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10Vgutierrez) p:05Triage→03Normal [11:57:35] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10Vgutierrez) that panic comes from mtail itself, after all varnishmtail is running: `/usr/bin/varnishncsa -n frontend -c -b -F "${FMT}" | mtail -progs "${PROGS}" -logs /dev/stdin` on the same host, atsmtail l... [12:03:40] (03PS1) 10Muehlenhoff: Extend access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/567013 [12:06:48] (03PS2) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [12:08:03] (03CR) 10Reedy: mediawiki: check mw versions match those on the deploy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:10:35] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/567013 (owner: 10Muehlenhoff) [12:13:22] (03PS3) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [12:15:30] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:17:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:17:56] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:18:54] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:19:46] (03CR) 10Hnowlan: mediawiki: check mw versions match those on the deploy server (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:19:54] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:20:08] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:21:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:21:58] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:22:10] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:28:52] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:29:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 33 probes of 517 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:29:38] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:30:44] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:30:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:32:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:32:35] (03PS4) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [12:36:12] (03CR) 10Reedy: mediawiki: check mw versions match those on the deploy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:36:53] (03PS1) 10Arturo Borrero Gonzalez: toolforge: elasticsearch: nginx: fix dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/567022 (https://phabricator.wikimedia.org/T236606) [12:38:06] (03CR) 10Hnowlan: mediawiki: check mw versions match those on the deploy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:39:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:39:04] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:39:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: elasticsearch: nginx: fix dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/567022 (https://phabricator.wikimedia.org/T236606) (owner: 10Arturo Borrero Gonzalez) [12:40:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:41:00] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:41:07] (03PS1) 10Mvolz: Remove config for xisbn [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [12:41:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:41:34] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:42:20] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:46:56] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 2 probes of 591 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:49:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 34 probes of 517 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:49:54] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:18] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:28] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:51:52] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:52:00] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:52:02] (03PS1) 10Arturo Borrero Gonzalez: wmcs: eqiad1: repool cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/567024 (https://phabricator.wikimedia.org/T241313) [12:53:06] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:53:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: eqiad1: repool cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/567024 (https://phabricator.wikimedia.org/T241313) (owner: 10Arturo Borrero Gonzalez) [12:57:22] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1013 - https://phabricator.wikimedia.org/T242472 (10aborrero) [12:57:24] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014) - https://phabricator.wikimedia.org/T138509 (10aborrero) [12:59:17] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10aborrero) [13:00:52] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:04] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10aborrero) @Jclark-ctr I believe this server may need the BBU checked/replaced, but I may be wrong. [13:01:53] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10aborrero) BTW this server has active workloads (pooled) at the moment. Please @Jclark-ctr coordinate with WMCS before shutting server down. [13:02:07] (03PS2) 10Ema: cache: remove/update cache_text backend VTC [puppet] - 10https://gerrit.wikimedia.org/r/566797 (https://phabricator.wikimedia.org/T241239) [13:06:30] (03Abandoned) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel) [13:06:43] (03Abandoned) 10Gehel: Cleanup a few warnings. [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737 (owner: 10Gehel) [13:08:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 (owner: 10BryanDavis) [13:09:23] (03Abandoned) 10Gehel: WIP - Tune thread for osm2pgsql / postgres max connections for Maps [puppet] - 10https://gerrit.wikimedia.org/r/293320 (https://phabricator.wikimedia.org/T137229) (owner: 10Gehel) [13:09:27] (03Abandoned) 10Gehel: Move es-tool to a proper python package [puppet] - 10https://gerrit.wikimedia.org/r/290765 (owner: 10Gehel) [13:09:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 (owner: 10BryanDavis) [13:09:33] (03Abandoned) 10Gehel: WIP - postgresql initialization script [puppet] - 10https://gerrit.wikimedia.org/r/304453 (owner: 10Gehel) [13:09:44] (03Abandoned) 10Gehel: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [13:12:22] (03PS2) 10Alexandros Kosiaris: rbac: Move under common/ directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/566998 [13:12:24] (03PS1) 10Alexandros Kosiaris: admin: Use a template to drop values symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/567035 [13:13:30] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Use a template to drop values symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/567035 (owner: 10Alexandros Kosiaris) [13:14:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] rbac: Move under common/ directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/566998 (owner: 10Alexandros Kosiaris) [13:14:29] (03Merged) 10jenkins-bot: admin: Use a template to drop values symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/567035 (owner: 10Alexandros Kosiaris) [13:14:39] (03Merged) 10jenkins-bot: rbac: Move under common/ directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/566998 (owner: 10Alexandros Kosiaris) [13:14:40] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10aborrero) [13:15:22] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10aborrero) [13:16:01] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:48] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:19:28] (03CR) 10Ema: [C: 03+2] cache: remove/update cache_text backend VTC [puppet] - 10https://gerrit.wikimedia.org/r/566797 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [13:19:51] (03CR) 10Ema: [C: 03+2] cache: remove/update cache_text backend VTC [puppet] - 10https://gerrit.wikimedia.org/r/566797 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [13:22:50] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:34:08] (03PS1) 10Marostegui: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567042 (https://phabricator.wikimedia.org/T232446) [13:35:15] (03CR) 10Marostegui: [C: 03+2] db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567042 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [13:36:44] (03PS2) 10Alexandros Kosiaris: admin: DRY podsecuritypolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/566999 [13:36:46] (03PS2) 10Alexandros Kosiaris: admin: Align staging symlink with production clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567000 [13:36:48] (03PS2) 10Alexandros Kosiaris: admin: get rid of apply-calico-policy.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/567001 [13:36:50] (03PS2) 10Alexandros Kosiaris: admin: Realign calico policies between clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567002 [13:36:52] (03PS1) 10Alexandros Kosiaris: admin: DRY cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/567043 [13:38:20] (03CR) 10Alexandros Kosiaris: "Already done in https://gerrit.wikimedia.org/r/559106" [deployment-charts] - 10https://gerrit.wikimedia.org/r/567043 (owner: 10Alexandros Kosiaris) [13:38:24] (03Abandoned) 10Alexandros Kosiaris: admin: DRY cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/567043 (owner: 10Alexandros Kosiaris) [13:38:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: DRY podsecuritypolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/566999 (owner: 10Alexandros Kosiaris) [13:38:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Align staging symlink with production clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567000 (owner: 10Alexandros Kosiaris) [13:38:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: get rid of apply-calico-policy.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/567001 (owner: 10Alexandros Kosiaris) [13:39:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Realign calico policies between clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567002 (owner: 10Alexandros Kosiaris) [13:40:14] (03Merged) 10jenkins-bot: admin: Align staging symlink with production clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567000 (owner: 10Alexandros Kosiaris) [13:40:16] (03Merged) 10jenkins-bot: admin: get rid of apply-calico-policy.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/567001 (owner: 10Alexandros Kosiaris) [13:43:28] Anyone got a link for the server performance logs? [13:43:43] I'm seeing some "timeouts" on English Wikisource/Wikivyage [13:43:53] And not even a WMF error [13:46:20] ShakespeareFan00: hi! Can you please follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue and report back? [13:48:14] (03PS3) 10Alexandros Kosiaris: admin: Realign calico policies between clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/567002 [13:49:22] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:51:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:52:08] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:52:14] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:53:36] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:54:04] ripe's API seems in trouble [13:54:18] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/565617 (owner: 10Muehlenhoff) [13:55:46] (03Abandoned) 10Gehel: wdqs: collect JMX metrics from ConcurrentHttpRequestsFilter [puppet] - 10https://gerrit.wikimedia.org/r/463511 (https://phabricator.wikimedia.org/T204364) (owner: 10Gehel) [13:56:06] I'm manually testing some of the https://atlas.ripe.net/api endpoints and they do occasionally timeout after 60s with 504 from nginx [13:56:23] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10Jclark-ctr) @aborrero It is out of warranty i do have a spare bbu and can replace it. @JHedden and i had spoken briefly regarding this one last night. I am on site n... [13:56:29] (03Abandoned) 10Gehel: wdqs: allow configuring wdqs-updater heap size [puppet] - 10https://gerrit.wikimedia.org/r/475717 (https://phabricator.wikimedia.org/T210290) (owner: 10Gehel) [13:56:30] ema: which HTTP method? GET? [13:56:33] (03Abandoned) 10Gehel: elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) (owner: 10Gehel) [13:56:38] vgutierrez: GET [13:57:14] that sounds like Excimer GET timeout [13:57:34] from: https://wikitech.wikimedia.org/wiki/HTTP_timeouts [13:57:45] some beautiful guy wrote that [13:58:14] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:21] damn mtail [14:00:34] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:01:06] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:01:14] the 504s are on RIPE's side, to be clear [14:01:30] (03Abandoned) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [14:01:48] (03Abandoned) 10Gehel: portals: cleanup Apache configuration template [puppet] - 10https://gerrit.wikimedia.org/r/340132 (owner: 10Gehel) [14:01:51] (03Abandoned) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130 (owner: 10Gehel) [14:01:55] (03Abandoned) 10Gehel: osm - trying to fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345866 (owner: 10Gehel) [14:02:08] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:02:16] (03Abandoned) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342232 (owner: 10Gehel) [14:02:19] (03Abandoned) 10Gehel: confd - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373517 (owner: 10Gehel) [14:02:20] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:02:22] (03Abandoned) 10Gehel: dumps - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373518 (owner: 10Gehel) [14:02:25] (03Abandoned) 10Gehel: service::node - add a defined() guard on git deployment [puppet] - 10https://gerrit.wikimedia.org/r/347855 (owner: 10Gehel) [14:02:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:37] ACKNOWLEDGEMENT - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Ema RIPEs API is currently timing out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:37] ACKNOWLEDGEMENT - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Ema RIPEs API is currently timing out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:37] ACKNOWLEDGEMENT - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Ema RIPEs API is currently timing out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:37] ACKNOWLEDGEMENT - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Ema RIPEs API is currently timing out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:37] ACKNOWLEDGEMENT - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Ema RIPEs API is currently timing out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:53] (03PS3) 10Jbond: WIP: puppetvagrant [puppet] - 10https://gerrit.wikimedia.org/r/561858 (owner: 10Ema) [14:05:51] (03CR) 10Jbond: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/561858 (owner: 10Ema) [14:08:33] (03PS3) 10Alexandros Kosiaris: Deduplicate cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/559106 [14:09:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] Deduplicate cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/559106 (owner: 10Alexandros Kosiaris) [14:09:25] (03Merged) 10jenkins-bot: Deduplicate cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/559106 (owner: 10Alexandros Kosiaris) [14:11:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:11:22] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:11:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:12:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 32 probes of 517 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:14:34] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:15:44] (03PS3) 10Alexandros Kosiaris: cluster-helmfile: Add a simple sleep 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/559107 [14:16:25] <_joe_> uhmmm [14:16:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:16:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38985056 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:17:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Remove all urldownloader IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/565016 (https://phabricator.wikimedia.org/T224551) (owner: 10Alexandros Kosiaris) [14:17:21] (03PS3) 10Alexandros Kosiaris: calico: Remove all urldownloader IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/565016 (https://phabricator.wikimedia.org/T224551) [14:18:26] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:19:22] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:19:25] (03PS4) 10Alexandros Kosiaris: cluster-helmfile: Add a simple sleep 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/559107 [14:20:28] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 14032 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:20:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] cluster-helmfile: Add a simple sleep 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/559107 (owner: 10Alexandros Kosiaris) [14:21:11] (03Merged) 10jenkins-bot: cluster-helmfile: Add a simple sleep 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/559107 (owner: 10Alexandros Kosiaris) [14:22:30] (03CR) 10Jhedden: [C: 03+2] wmcs-cold-migrate: update glance for v2 client [puppet] - 10https://gerrit.wikimedia.org/r/566788 (owner: 10Jhedden) [14:23:41] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:13] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] (03CR) 10Alexandros Kosiaris: Add recommendation-api chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:26:42] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 29 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:27:56] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:28:08] !log uploaded mtail 3.0.0~rc5-1~bpo9+1wmf2 to apt.wm.o (buster) - T243591 [14:28:10] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:11] T243591: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 [14:28:51] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Solved after rebuilding mtail 3.0.0~rc5 for buster [14:28:54] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [14:35:16] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:37:07] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) >>! In T243444#5826816, @akosiaris wrote: >> Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or hig... [14:39:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:42:08] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) 05Resolved→03Open Drive 9 reported a lot of errors while rebuilding the RAID array, and now drives 2, 4, and 9 are missing from the RAI... [14:42:09] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10JHedden) [14:43:56] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10Vgutierrez) reported to upstream as https://github.com/google/mtail/issues/289 [14:45:26] (03CR) 10CDanis: "Are these configs even needed anymore? ats-tls is everywhere, right?" [puppet] - 10https://gerrit.wikimedia.org/r/560514 (owner: 10Jcrespo) [14:49:33] (03CR) 10Vgutierrez: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/560514 (owner: 10Jcrespo) [14:49:56] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:50:08] cdanis: ^^ localssl is still being used, but not on the edge of the caching layer [14:50:09] (03CR) 10CDanis: [C: 03+1] Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/560514 (owner: 10Jcrespo) [14:50:26] vgutierrez: ah sure, seems fine to revert that then [14:55:22] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 2 probes of 591 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:02:36] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:02:55] heh [15:03:02] RIPE AtlasPI servers having some problems huh? [15:03:07] s/PI/ API/ [15:07:32] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:11:10] 10Operations, 10LDAP-Access-Requests: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10MoritzMuehlenhoff) 05Open→03Stalled [15:25:28] (03PS1) 10Ottomata: eventstreams - scrape prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/567079 [15:26:27] (03PS1) 10Jbond: beaker: application testsing [puppet] - 10https://gerrit.wikimedia.org/r/567080 [15:27:25] (03PS2) 10Ottomata: eventstreams - scrape prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/567079 [15:28:16] cdanis: yup, their endpoints have been timing out for a while now [15:28:17] (03CR) 10jerkins-bot: [V: 04-1] beaker: application testsing [puppet] - 10https://gerrit.wikimedia.org/r/567080 (owner: 10Jbond) [15:29:22] (03CR) 10Ottomata: [C: 03+2] eventstreams - scrape prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/567079 (owner: 10Ottomata) [15:29:25] (03PS3) 10Ottomata: eventstreams - scrape prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/567079 [15:29:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - scrape prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/567079 (owner: 10Ottomata) [15:31:10] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T243605 (10ops-monitoring-bot) [15:33:13] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T243605 (10JHedden) [15:33:16] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:33:20] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) [15:33:35] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [15:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:50] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1007.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:14] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [15:36:40] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:36:56] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:37:36] (03PS1) 10BryanDavis: rebuild_all: Add a --no-run-if-empty guard [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/567081 [15:41:45] (03CR) 10BryanDavis: [C: 03+2] rebuild_all: Add a --no-run-if-empty guard [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/567081 (owner: 10BryanDavis) [15:42:00] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:42:26] (03Merged) 10jenkins-bot: rebuild_all: Add a --no-run-if-empty guard [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/567081 (owner: 10BryanDavis) [15:43:37] !log restart blazegraph + updater on wdqs1007 (seems stuck, known issue) [15:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:55] (03PS1) 10MSantos: Disable replication cron in eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/567088 [15:55:16] gehel: ^ [15:55:56] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10colewhite) [15:56:34] (03CR) 10Gehel: [C: 03+2] Disable replication cron in eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/567088 (owner: 10MSantos) [16:09:11] (03PS2) 10Jbond: beaker: application testsing [puppet] - 10https://gerrit.wikimedia.org/r/567080 [16:10:27] (03PS3) 10Jbond: beaker: application testsing [puppet] - 10https://gerrit.wikimedia.org/r/567080 [16:11:05] (03PS4) 10Jbond: beaker: application testing [puppet] - 10https://gerrit.wikimedia.org/r/567080 [16:11:32] (03CR) 10Jbond: "I like this work and think it would be nice to be able to spin up machines with random bits of puppet code. I have also taken this code a" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/562527 (owner: 10Filippo Giunchedi) [16:13:12] (03CR) 10jerkins-bot: [V: 04-1] beaker: application testing [puppet] - 10https://gerrit.wikimedia.org/r/567080 (owner: 10Jbond) [16:13:30] (03PS5) 10Jbond: beaker: application testing [puppet] - 10https://gerrit.wikimedia.org/r/567080 [16:16:02] (03CR) 10jerkins-bot: [V: 04-1] beaker: application testing [puppet] - 10https://gerrit.wikimedia.org/r/567080 (owner: 10Jbond) [16:16:03] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) >>! In T243444#5829606, @Mvolz wrote: >>>! In T243444#5826816, @akosiaris wrote: >>> Thanks. That's working now, but I've downloaded the log file and it's just what's al... [16:16:11] (03PS6) 10Jbond: beaker: application testing [puppet] - 10https://gerrit.wikimedia.org/r/567080 [16:16:17] (03PS1) 10Ayounsi: FNM bump ban_details_records_count to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/567092 [16:19:58] (03CR) 10CDanis: [C: 03+1] FNM bump ban_details_records_count to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/567092 (owner: 10Ayounsi) [16:20:12] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24008024 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:20:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22026288 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:22:00] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 42688 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:22:42] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 57504 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:24:12] (03PS2) 10Bstorm: wiki replicas: Remove outdated comment about spamblacklist [puppet] - 10https://gerrit.wikimedia.org/r/561352 (https://phabricator.wikimedia.org/T241668) (owner: 10BryanDavis) [16:26:48] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Remove outdated comment about spamblacklist [puppet] - 10https://gerrit.wikimedia.org/r/561352 (https://phabricator.wikimedia.org/T241668) (owner: 10BryanDavis) [16:29:45] (03PS1) 10CDanis: fastnetmon: set a very short ban_time [puppet] - 10https://gerrit.wikimedia.org/r/567093 (https://phabricator.wikimedia.org/T237587) [16:36:03] (03CR) 10Ayounsi: [C: 03+2] FNM bump ban_details_records_count to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/567092 (owner: 10Ayounsi) [16:38:51] (03PS1) 10Bstorm: nfs: fixup the test class to work on VMs a bit more [puppet] - 10https://gerrit.wikimedia.org/r/567095 [16:39:12] (03CR) 10Ayounsi: [C: 03+1] fastnetmon: set a very short ban_time [puppet] - 10https://gerrit.wikimedia.org/r/567093 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [16:40:42] PROBLEM - Host ms-be2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:51] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) During the next rebuild the RAID array kicked out Drive 4. Either we have 3 bad drives 2, 4 and 9 or the RAID adapter is bad. I'll send the... [16:50:16] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) This host has been depooled from production and has no running workloads. [17:01:47] (03PS2) 10Bstorm: nfs: fixup the test class to work on VMs a bit more [puppet] - 10https://gerrit.wikimedia.org/r/567095 [17:04:32] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:04:32] (03CR) 10Bstorm: [C: 03+2] nfs: fixup the test class to work on VMs a bit more [puppet] - 10https://gerrit.wikimedia.org/r/567095 (owner: 10Bstorm) [17:12:55] (03PS2) 10CDanis: fastnetmon: set a very short ban_time [puppet] - 10https://gerrit.wikimedia.org/r/567093 (https://phabricator.wikimedia.org/T237587) [17:14:55] (03CR) 10CDanis: [C: 03+2] fastnetmon: set a very short ban_time [puppet] - 10https://gerrit.wikimedia.org/r/567093 (https://phabricator.wikimedia.org/T237587) (owner: 10CDanis) [17:19:49] (03PS1) 10Alexandros Kosiaris: wikifeeds: Add tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/567100 [17:25:01] jouncebot: now [17:25:01] No deployments scheduled for the next 234 hour(s) and 4 minute(s) [17:33:13] (03PS1) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [17:33:32] (03CR) 10jerkins-bot: [V: 04-1] WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 (owner: 10Alexandros Kosiaris) [17:40:32] (03PS1) 10Tchanders: Disable Special:Investigate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567106 [17:41:43] (03PS2) 10Tchanders: Disable Special:Investigate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567106 [17:45:38] (03CR) 10Jforrester: [C: 03+2] Disable Special:Investigate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567106 (owner: 10Tchanders) [17:46:39] (03Merged) 10jenkins-bot: Disable Special:Investigate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567106 (owner: 10Tchanders) [17:50:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:50:30] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:08] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:08] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:18] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:42] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:46] (03PS1) 10Jforrester: IS: Move all CheckUser config together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567108 [17:52:05] (03CR) 10Jforrester: [C: 03+2] IS: Move all CheckUser config together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567108 (owner: 10Jforrester) [17:53:14] (03Merged) 10jenkins-bot: IS: Move all CheckUser config together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567108 (owner: 10Jforrester) [17:54:52] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Clean up CheckUser config (duration: 01m 09s) [17:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:08] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:55:28] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:06] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 34 probes of 517 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:26] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:40] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:43] sigh, that failure should be an UNKNOWN. really the whole thing should be a check_prometheus at this point [17:56:44] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:58:15] (03PS1) 10CDanis: check_ripe_atlas: HTTP failures are UNKNOWN not CRITICAL [puppet] - 10https://gerrit.wikimedia.org/r/567109 [18:07:12] (03CR) 10RLazarus: "I've only looked at the Python here. Good structure!" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [18:12:20] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:12:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:12:54] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:12:54] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:13:52] 10Operations, 10DNS, 10Domains, 10Traffic: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10CRoslof) 05Stalled→03Resolved a:03CRoslof These domain names have now been transferred to the Foundation and I've updated them to use the Foundation's na... [18:14:31] (03CR) 10Volans: [C: 03+1] "I'm ok with this as a temporary change towards probably a prometheus approach, there are many other things to fix in this script and the c" [puppet] - 10https://gerrit.wikimedia.org/r/567109 (owner: 10CDanis) [18:15:35] volans: rlazarus: something that would be nice to have is a utility library for writing icinga check commands. there's a bunch of reusable stuff like exit code values and also setting the right User-Agent by policy [18:15:40] (related to these two CRs just now ;) [18:16:12] agree, there have been an experiment in the past to use a python module, but arguably makes thing easier [18:16:16] so it didn't get much traction [18:17:08] in general I think we could aim for a python wmflib module with a bunch of foundation stuff that is too generic even for spicerack, in the sense that are needed in multiple scripts around the fleet [18:17:24] and not only for orchestration [18:17:32] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:17:38] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:17:54] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:17:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:18:17] agree with both of you but I think it's worth dividing more finely [18:18:25] "from wmflib import icinga_check" or whatever [18:18:56] which should definitely include a more generic wmflib user-agent helper, and so on [18:19:19] (03Abandoned) 10Dzahn: switch discovery record for etherpad from 1001 to 1002 [dns] - 10https://gerrit.wikimedia.org/r/566906 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [18:20:19] sure [18:20:50] when I wrote `class NagiosExitCode(IntEnum):` I was amused no one else had done so before [18:21:03] (03Abandoned) 10Dzahn: add etherpad.discovery.wmnet, point to etherpad1001 [dns] - 10https://gerrit.wikimedia.org/r/566905 (owner: 10Dzahn) [18:21:29] also a bunch of the structure of setting up logging and stuff I re-used from check_prometheus when writing check_librenms, there's maybe some common bits to be extracted there as well [18:21:40] nod [18:22:18] (03Abandoned) 10Dzahn: trafficserver/varnish: use discovery for etherpad, drop etherpad1001 director [puppet] - 10https://gerrit.wikimedia.org/r/566907 (owner: 10Dzahn) [18:23:41] (03CR) 10RLazarus: Icinga alert for LibreNMS critical alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [18:24:12] cdanis: yes it has been done before [18:24:21] in similar ways [18:24:27] and surely all different from each other ;) [18:24:50] (03CR) 10Ayounsi: "Only did a superficial swoop on the python file. Checked the Puppet side." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [18:26:07] (03PS1) 10Dzahn: install_server: remove etherpad1001 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/567113 (https://phabricator.wikimedia.org/T224580) [18:26:59] (03CR) 10Dzahn: [C: 03+2] install_server: remove etherpad1001 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/567113 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [18:34:25] (03PS1) 10Dzahn: remove etherpad1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/567115 (https://phabricator.wikimedia.org/T224580) [18:38:39] (03PS2) 10Majavah: Add logos for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) [18:39:54] (03CR) 10Majavah: "Hi Jforrester, thanks for letting me know. I've attempted to convert the files to correct sizes, however my limited image editing skills w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [18:44:37] (03PS1) 10Bstorm: labstore: remove profile from top-level module [puppet] - 10https://gerrit.wikimedia.org/r/567116 (https://phabricator.wikimedia.org/T224582) [18:48:59] (03PS3) 10Jforrester: Add logos for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [18:49:37] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [19:05:24] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:05:40] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:06:18] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:06:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:06:26] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:06:28] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:10:32] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:10:48] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:11:22] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:11:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:11:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 33 probes of 517 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:11:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 595 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:13:31] (03CR) 10Majavah: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [19:19:13] (03CR) 10Jhedden: [C: 03+1] labstore: remove profile from top-level module [puppet] - 10https://gerrit.wikimedia.org/r/567116 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [19:34:52] (03CR) 10RLazarus: Icinga alert for LibreNMS critical alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [19:37:29] (03CR) 10CDanis: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/567109 (owner: 10CDanis) [19:40:18] ok ripe atlas noise should be over [19:44:27] thx [19:49:38] (03CR) 10Bstorm: "Ran PCC and found that on the primary cluster it re-ordered some package installs, but it didn't change any of them" [puppet] - 10https://gerrit.wikimedia.org/r/567116 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [19:51:01] (03PS5) 10CDanis: Icinga alert for LibreNMS critical alerts [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) [19:51:21] (03CR) 10CDanis: "Thanks for the reviews!" (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [19:52:48] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/567116 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [19:53:02] (03PS2) 10Bstorm: labstore: remove profile from top-level module [puppet] - 10https://gerrit.wikimedia.org/r/567116 (https://phabricator.wikimedia.org/T224582) [19:59:57] (03PS1) 10Jhedden: aptrepo: thirdparty/openstack-pike-stretch: add python-openstackclient [puppet] - 10https://gerrit.wikimedia.org/r/567127 (https://phabricator.wikimedia.org/T241347) [20:07:10] (03CR) 10Bstorm: [C: 03+2] labstore: remove profile from top-level module [puppet] - 10https://gerrit.wikimedia.org/r/567116 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [20:16:27] (03CR) 10Jhedden: [C: 03+2] aptrepo: thirdparty/openstack-pike-stretch: add python-openstackclient [puppet] - 10https://gerrit.wikimedia.org/r/567127 (https://phabricator.wikimedia.org/T241347) (owner: 10Jhedden) [20:58:48] (03CR) 10Volans: "Nicer, some comment inline, in addition 2 main comments:" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [20:59:53] (03PS1) 10Bstorm: labstore: finish up making this class work on VMs [puppet] - 10https://gerrit.wikimedia.org/r/567142 (https://phabricator.wikimedia.org/T224582) [21:00:32] (03CR) 10RLazarus: [C: 03+1] Icinga alert for LibreNMS critical alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [21:08:07] (03CR) 10Bstorm: "This actually works on a livehack into the puppetmaster on my Cloud project. Checking the compiler just to be sure it's a noop on the "re" [puppet] - 10https://gerrit.wikimedia.org/r/567142 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:09:50] I am seeing some flapping of page loads for wikitech. Not sure yet if one of the labweb hosts is busted or if the CDN servers are having issues or ??? Presenting as 503 responses to my browser [21:10:02] (03CR) 10CDanis: Icinga alert for LibreNMS critical alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [21:10:09] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/20565/" [puppet] - 10https://gerrit.wikimedia.org/r/567142 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:10:13] (03PS6) 10CDanis: Icinga alert for LibreNMS critical alerts [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) [21:11:37] and as soon as I say something on irc about it it stops ;) [21:13:22] report of a similar 503 from varnish for Phabricator. Making me more suspicious of a varnish server having connectivity issues [21:14:03] "Request from 2001:470:b:530:749d:6fe7:f206:c80d via cp4029 frontend, Varnish XID 660232439 Error: 503, Backend fetch failed at Fri, 24 Jan 2020 21:13:33 GMT" -- that was me trying https://phabricator.wikimedia.org/ [21:14:57] "Request from 2001:470:b:530:749d:6fe7:f206:c80d via cp4029 frontend, Varnish XID 668021949 Error: 503, Backend fetch failed at Fri, 24 Jan 2020 21:14:30 GMT" -- https://wikitech.wikimedia.org/wiki/MediaWiki#Infrastructure [21:17:45] https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X?_g=h@839389a&_a=h@09fb9cc [21:17:47] :) [21:18:04] !log ✔️ cdanis@cp4029.ulsfo.wmnet ~ 🕟🍵 sudo depool [21:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:13] not sure what happened [21:19:46] thanks cdanis [21:21:03] I was looking at grafana nad it has some weird metrics [21:21:18] also icinga was having 3 checks in unknown for timeout regarting ats [21:21:53] were teh 3 *_fifo_* checks for context [21:22:54] the lsofs in those fifo checks were using 100% cpu [21:22:59] and taking multiple seconds to execute [21:23:00] very strange [21:23:46] yeah, still happening [21:24:24] on another host that takes 1.6 seconds; on this one takes 11 seconds [21:24:59] wow [21:25:04] clearly not healthy [21:26:21] it seems it started around 20:39 [21:27:17] ehm [21:27:24] the machine has many many many filedescriptors open [21:30:12] there is a varnishd, pid 42249, with 499813 fds [21:31:27] smells like a 500k limit [21:31:40] so so so sooooooo many procs in TIME_WAIT [21:32:37] collected some netstat output in my homedir [21:33:59] (03CR) 10Bstorm: "Since it's noops, actually, I'm merging this." [puppet] - 10https://gerrit.wikimedia.org/r/567142 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:34:02] (03CR) 10Bstorm: [C: 03+2] labstore: finish up making this class work on VMs [puppet] - 10https://gerrit.wikimedia.org/r/567142 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:34:22] (03CR) 10Bstorm: [C: 03+2] labstore: finish up making this class work on VMs [puppet] - 10https://gerrit.wikimedia.org/r/567142 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [21:42:27] (03PS1) 10RLazarus: Refactor for better multi-host execution: [software/httpbb] - 10https://gerrit.wikimedia.org/r/567147 [21:44:49] (03CR) 10RLazarus: "Happy All Hands, enjoy this large code review! 🙃" [software/httpbb] - 10https://gerrit.wikimedia.org/r/567147 (owner: 10RLazarus) [22:19:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [22:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [22:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:51] !log shutting down etherpad1001 - service fully migrated to etherpad1002 - running decom cookbook on ganeti VM (T224580) [22:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:54] T224580: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 [22:26:39] volans: ran the decom script on ganeti VM. it tells me i should manually check if shutdown has worked. result is i can't ssh to it anymore but in gnt-instance list it is still shown as running. so for now i use gnt-instance remove [22:27:46] mutante: yes, I've added the capability to spicerack to do gnt-* commands yesterday, will be able to handle that part too soon [22:27:47] and then i checked netbox because it tells me to set the status and wanted to set it to decommissioned but i noticed netbox only offers me "Active", "Offline" and "Staged". so "offline" for "removed" ? [22:27:50] but for noe yes [22:28:00] mutante: no need to do anything on netbox [22:28:12] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [22:28:13] there is an automatic sync of vms from ganeti [22:28:32] volans: oh, ok! cool, i said that because of "Set Netbox status on VM not yet supported: **manual intervention required** [22:28:46] i saw you were working on gnt-instance support i think. ACK, thank you, cool! [22:29:03] yeah could be clearer [22:29:30] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [22:30:07] and this time i ran it before removing the IP from DNS, heh :p [22:31:44] !log ganeti1003 - sudo gnt-instance remove etherpad1001.eqiad.wmnet (T224580) [22:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:47] T224580: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 [22:37:56] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:38:35] (03PS2) 10Dzahn: remove etherpad1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/567115 (https://phabricator.wikimedia.org/T224580) [22:41:16] ^ re: icinga. https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts says to first check real traffic usage and i did and that looks normal to go down like that in the 24 hour view [22:41:23] this is clearly the option "If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task" [22:41:31] as it's 36 vs 35 [22:41:51] (03PS1) 10C. Scott Ananian: The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) [22:43:11] (03CR) 10Dzahn: [C: 03+2] "VM has been decom'ed and removed in ganeti" [dns] - 10https://gerrit.wikimedia.org/r/567115 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [22:43:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:44:30] there we go. 29 probes failing is still ok. < 35 [22:48:17] (03PS1) 10C. Scott Ananian: The $wgMaxGeneratedPPNodeCount configuration variable no longer has any effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) [22:50:21] (03CR) 10jerkins-bot: [V: 04-1] The $wgMaxGeneratedPPNodeCount configuration variable no longer has any effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [22:52:33] mutante: as of a few weeks ago there's also some historical data on grafana (also linked in the alert). although the data is a bit spotty today as RIPE Atlas's API servers were having issues [22:58:07] cdanis: nice! thanks for pointing that out [23:02:18] (03PS1) 10Bstorm: cloudstore test: add the last couple ferm rules to let drbd work [puppet] - 10https://gerrit.wikimedia.org/r/567160 (https://phabricator.wikimedia.org/T224582) [23:04:36] (03CR) 10CDanis: [C: 03+2] "PCC looks good as far as I can tell. Thanks for the reviews" [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [23:11:41] (03CR) 10Bstorm: "This is only applied on the cloudstore VPS project where it is needed, so merging." [puppet] - 10https://gerrit.wikimedia.org/r/567160 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:11:50] (03PS2) 10Bstorm: cloudstore test: add the last couple ferm rules to let drbd work [puppet] - 10https://gerrit.wikimedia.org/r/567160 (https://phabricator.wikimedia.org/T224582) [23:14:21] (03CR) 10Bstorm: [C: 03+2] cloudstore test: add the last couple ferm rules to let drbd work [puppet] - 10https://gerrit.wikimedia.org/r/567160 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:24:22] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 39.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:29:00] esams traffic drop but already coming back up [23:30:50] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 72.79 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:35:55] ACKNOWLEDGEMENT - Host ms-be2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T243646 [23:36:21] (03PS1) 10Aaron Schulz: Remove old APCBagOStuff reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567162 [23:42:27] (03PS1) 10Dzahn: add wikiźródła.pl and wikisłownik.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/567163 (https://phabricator.wikimedia.org/T240446) [23:43:57] (03CR) 10Dzahn: "Brandon, just cause i know they just switched over to our NS.. so why not properly add them. Or no?" [dns] - 10https://gerrit.wikimedia.org/r/567163 (https://phabricator.wikimedia.org/T240446) (owner: 10Dzahn) [23:50:36] (03PS1) 10Volans: spicerack: add getter for the Netbox master host [software/spicerack] - 10https://gerrit.wikimedia.org/r/567164 [23:58:59] (03PS5) 10Dzahn: contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) [23:59:22] (03CR) 10Dzahn: contint: use package_from_component, stop using docker class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)