[00:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T0000). [00:00:05] kart_ and Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:45] PROBLEM - Maps - OSM synchronization lag - codfw on icinga2001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [00:00:49] oh, I guess I should have checked if jdlrobson was around pre +2: jdlrobson around? [00:02:24] * kart_ is here with coffee [00:03:28] * thcipriani waits on jenkins [00:05:00] testing [00:05:10] i'm lurking :) [00:05:45] jdlrobson: cool, since I'm adding a submodule and you have l10nupdates, I may just sync that out all together if that's find with you [00:08:50] thcipriani: lemme know when it's on debug1002 [00:08:55] jdlrobson: will do [00:09:40] James_F: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Add_new_extension_to_extension-list_and_release_tools makes it sound like the code should be on all servers prior to adding to extension-list (asking since I've seen you do this more recently than I :)) [00:10:57] thcipriani: so, it should be on wmf.16 and wmf.17. Hope that is sufficient (when adding via submodule) [00:11:48] kart_: yeah, that's my thinking/current plan: add as submodule to wmf.17, merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/489627, run scap sync, sound right? [00:12:06] oh thcipriani gerrit's load shot up [00:12:08] :( [00:12:14] (03PS3) 10Paladox: Update bazlets to latest 2.15 commit [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489422 [00:12:17] paladox: oh good. [00:12:17] (03PS6) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 (https://phabricator.wikimedia.org/T214631) [00:12:22] (03PS8) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 (https://phabricator.wikimedia.org/T215658) [00:12:23] thcipriani https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc [00:12:24] (03PS1) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [00:14:17] thcipriani: sounds good. [00:14:34] Who broke gerrit? [00:14:37] HTTP 500-ing [00:14:40] :( [00:14:44] Reedy no one, high load again [00:14:49] (03PS1) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in esams [puppet] - 10https://gerrit.wikimedia.org/r/490229 (https://phabricator.wikimedia.org/T213708) [00:14:57] paladox: Well, surely it's the gerrit developers then [00:15:00] For writing bad code? ;) [00:15:03] lol [00:15:18] Reedy it could be related to what someone else experenced a year ago [00:16:06] load seems to be decreasing, seems like it may have been processing a huge request, this looks nothing like yesterday afaict. [00:16:25] hmm [00:16:30] I seem to have a patch in limbo [00:16:30] (03PS9) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 (https://phabricator.wikimedia.org/T215658) [00:16:40] thcipriani: extension-list is the first place touched when deploying, and last when undeploying, because it's only used at full scap time. [00:16:44] ah great thcipriani! [00:16:44] Gerrit doesn't seem to know it exists, but git won't review [00:17:12] (03CR) 10Ayounsi: [C: 03+2] Monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:17:17] thcipriani: And yes, your impression is correct, or everything blows up when you full scap. [00:17:19] Reedy i guess you pushed it during the high load? [00:17:43] I attempted to [00:17:47] Just change the change-id and push again [00:17:49] $profit [00:17:57] (03PS5) 10Ayounsi: Monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:18:00] I think RoanKattouw did that too, and his patch broke too [00:18:04] (03PS3) 10Bstorm: toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) [00:18:38] (03PS2) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [00:18:39] James_F: so, wait, this new extension is a backport to wmf.16, does that mean I need to actually sync-ddir the code extension code first? then modify extensions-list, then scap sync? [00:18:56] Yup I also repushed mine with a different Change-Id [00:19:15] * Reedy high fives RoanKattouw [00:19:18] James_F: or can I fetch changes to php-1.33.0-wmf.16, modify extension-list, and then run scap sync as one step? [00:19:26] I wonder if the numerical id is lost forever [00:19:37] thcipriani: Yes. It needs to be in php-….16 and ….17 before the full scap sync. [00:20:02] thcipriani: i18n gets built first from the files on the deployment host, right? [00:20:10] yep [00:20:18] So you don't need to sync the unused code out to servers before building i18n. [00:20:33] It won't hurt, but it's not necessary to scap sync-dir extensions/Foo. [00:20:37] ah, ok, that was my assumption, but the documentation gave me pause :) [00:20:59] (03PS4) 10Bstorm: toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) [00:21:04] made me think my mental model was broken in this case in some fundamental way [00:21:20] (as is so often the case) [00:22:06] James_F: thanks for clarification [00:22:27] thcipriani: maybe we can update 'documentation' once we finish deploy? [00:23:10] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) (owner: 10Bstorm) [00:23:30] kart_: that's a good thought, I'll do that. [00:24:17] kart_: You must be new around here [00:24:27] Any time. [00:24:37] * James_F grins, tsk Reedy. [00:24:39] Reedy: kind of. [00:25:02] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14635/" [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:25:39] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Updates bazlets commit to 2b1d68959119920e5fa9bdfb9f0cf926bfef4929 which updates it's api to use 2.15.10." [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489422 (owner: 10Paladox) [00:26:38] (03PS6) 10Ayounsi: Monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:26:40] (03PS3) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [00:27:49] !log merge VRRP Icinga Check [00:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:56] come on jenkins [00:29:01] thcipriani: seriously 28 minutes+ :/ [00:29:35] this is...a surprising amount of time. [00:30:40] usually this test takes 10 minutes https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php70-docker/buildTimeTrend [00:32:13] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) @fgiunchedi thoughts on this? looks like we are talking about 10-100 G files, not quite Terabytes [00:34:37] (03PS4) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [00:34:43] Both of the patches I +2'd scheduled the same resource intensive job on the same node at the same time. [00:36:31] (03PS5) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [00:37:16] jdlrobson: your thanks patch is staged on mwdebug1002, check please [00:37:25] kart_: I'll stage your extension-list change now [00:37:28] ok testing.. [00:37:29] (03PS6) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [00:37:45] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) (owner: 10KartikMistry) [00:37:53] thcipriani: cool [00:38:23] thcipriani: you can sync! thanks! [00:41:44] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.17/extensions/Thanks/modules/ext.thanks.mobilediff.css: SWAT: [[gerrit:490199|Follow ups to I807f729c1b1a9e9b5952685bb18f540f81d70f47]] (duration: 00m 55s) [00:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:03] ^ jdlrobson css done, will get the l10n done with the big scap sync forthcoming [00:42:42] (03PS4) 10Thcipriani: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) (owner: 10KartikMistry) [00:42:49] (03CR) 10Thcipriani: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) (owner: 10KartikMistry) [00:43:01] (03CR) 10Thcipriani: [C: 03+2] "SWAT (again)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) (owner: 10KartikMistry) [00:43:06] 10Operations, 10monitoring, 10netops, 10Patch-For-Review: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264 (10ayounsi) 05Open→03Resolved a:05faidon→03ayounsi Confirmed working: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=vrrp [00:44:04] (03Merged) 10jenkins-bot: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) (owner: 10KartikMistry) [00:47:36] !log thcipriani@deploy1001 Started scap: SWAT: [[gerrit:489627|Add ExternalGuidance extension]] T213076 (part I: build l10n and sync code) [00:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:39] T213076: ExternalGuidance: Initial limited deployment - https://phabricator.wikimedia.org/T213076 [00:48:32] ^ kart_ started sync, config won't go live just yet, I just want to get the l10n rebuilt and the code on all the appservers. I'll sync-file the IS.php and CS.php files in next step (should be a fast step I think) [00:48:43] (03CR) 10jenkins-bot: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) (owner: 10KartikMistry) [00:49:43] thcipriani: OK! [00:49:55] thcipriani: ping me when they are up on debug. [00:50:01] kart_: will do! [00:51:17] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10mobrovac) [00:57:43] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10mobrovac) [01:01:21] syncing proxies currently, FYI [01:04:57] cool [01:09:33] cdb rebuild...almost the last step of scap sync [01:11:19] ugh, 60 second timeouts [01:11:26] eh! [01:12:44] kart_: ongoing issue, just makes logs harder to read, not a code issue [01:12:59] (since no code has gone out just yet :)) [01:13:16] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 73.61, 32.18, 20.32 [01:14:30] RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 30.98, 28.55, 20.00 [01:14:52] thcipriani: :) [01:15:27] !log thcipriani@deploy1001 Finished scap: SWAT: [[gerrit:489627|Add ExternalGuidance extension]] T213076 (part I: build l10n and sync code) (duration: 27m 51s) [01:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:32] T213076: ExternalGuidance: Initial limited deployment - https://phabricator.wikimedia.org/T213076 [01:15:34] I am getting load issues on user creation logs. [01:15:51] (loading info from 2017/2018) [01:16:05] Could it be related? [01:16:18] AFAIK those don't do API requests [01:16:38] kart_: well I see l10n https://en.wikipedia.org/wiki/MediaWiki:Externalguidance-desc [01:16:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:16:49] like https://en.wikipedia.org/wiki/Special:Log?type=newusers&user=&page=&wpdate=2017-06-11&tagfilter= [01:17:09] does not load for me [01:17:58] thcipriani: are we good? [01:18:08] kart_: your config is staged on mwdebug1002 [01:18:11] check please [01:18:18] Sure [01:19:34] That spike seems significant [01:19:50] Krinkle: indeed it does [01:19:57] Bsadowski1: Loads for me, just slow [01:19:58] Krinkle: just deployed l10n, however [01:20:07] that's only taken us down a few times [01:20:09] :) [01:20:12] :) [01:20:21] is that the hhvm cache situation? [01:20:32] brennen: that is my suspicion [01:20:55] Reedy: I got "PHP fatal error: entire web request took longer than 60 seconds and timed out" [01:20:58] hhvm load high on machines causing problems [01:21:13] mainly the longer than 60 second error [01:22:24] thcipriani: related to our deployment by chance? or general issue? [01:22:58] thcipriani: extention looks good. further testing needed throughout day but we can only confirm later. nothing breaks so far. [01:24:03] kart_: I believe this is a general issue. It may have been happening previously, but the 60 second timeout being turned on (rightfully so) has surfaced it in a big way has been my recent experience. [01:24:42] logs are seemingly returning to normal now [01:24:49] nice [01:25:00] long live php 7 [01:25:00] thcipriani: in that case, go ahead and deploy. [01:25:14] Reedy: I hope so! [01:26:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:26:54] kart_: ok, going [01:29:02] (03PS7) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [01:30:02] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:489627|Add ExternalGuidance extension]] T213076 (part 2) (duration: 00m 53s) [01:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:09] T213076: ExternalGuidance: Initial limited deployment - https://phabricator.wikimedia.org/T213076 [01:31:40] This is the first time I've witnessed the errors in real-time. That is a lot of errors, though. 2000 failed backend requests in 12 minutes or so. [01:31:52] !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:489627|Add ExternalGuidance extension]] T213076 (part 3) (duration: 00m 53s) [01:31:58] ^ kart_ you're all live [01:32:07] (03PS8) 10Paladox: Add support for cherry picking with merge conflicts from the UI (PolyGerrit) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 [01:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:15] thcipriani: Thanks a lot! Will get custom Garam Masala for you (ie https://en.wikipedia.org/wiki/Garam_masala) [01:32:35] Krinkle: it has made deployment more difficult recently [01:33:16] kart_: awesome! I look forward to it! [01:33:19] kart_: it's also on beta cluter now btw, https://simple.m.wikipedia.beta.wmflabs.org/wiki/Main_Page just FYI :) (master branch) [01:34:40] Krinkle: never expected any contents on Beta! [01:37:13] we normally enable new extensions there first. At least it still happens automatically at the same time, when the config change is merged in Gerrit. [01:37:42] And any new merges in master for the extension will also automatically get deployed there every few minutes. [01:40:46] (03CR) 10Ppchelko: [C: 03+1] "LGTM. I think we can merge it before the MW train has finished to allow for more testing." [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) (owner: 10Bmansurov) [01:48:27] Krinkle: noted, but in this case, testing was difficult or rather point less. Just like CX. [01:51:05] kart_: I think there's a large amount potential errors prevented by running the code without impacting real users before enabling it. That's basically what you did with mwdebug, so that was good. Although most of it could've also been done on beta. The other aspect (which we skipped for this extension) was how e.g. Varnish reacts to it, or in general what happens if it's online for more than a few minutes and some edits happen in the [01:51:05] background. that could only be done on beta cluster, or on test wiki in prod. [01:53:40] PROBLEM - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:53:52] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215989 [01:53:56] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215989 (10ops-monitoring-bot) [02:30:57] 10Operations, 10ops-esams: Setup new access switches - https://phabricator.wikimedia.org/T184065 (10ayounsi) [02:34:55] 10Operations, 10ops-esams: Repurpose csw2-oe14/15 and lab-ex4200 as msw - https://phabricator.wikimedia.org/T215991 (10ayounsi) p:05Triage→03Normal [02:35:20] 10Operations, 10ops-esams: Setup new access switches - https://phabricator.wikimedia.org/T184065 (10ayounsi) [02:35:22] 10Operations, 10ops-esams: Repurpose csw2-oe14/15 and lab-ex4200 as msw - https://phabricator.wikimedia.org/T215991 (10ayounsi) [02:38:41] 10Operations, 10Icinga, 10monitoring: icinga really needs to check puppet run success of passive icinga hosts - https://phabricator.wikimedia.org/T215848 (10colewhite) p:05Triage→03Normal [02:44:40] 10Operations, 10ops-esams, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) [03:04:52] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) [03:35:21] 10Operations, 10MediaWiki-Database, 10Performance-Team, 10Wikimedia-Logstash, and 5 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10bd808) >>! In T215611#4938229, @CDanis wrote: > I don't feel nearly well-versed in PHP/PSR-3/Monolog nor the MW codebase to sugges... [03:54:19] (03PS1) 10Andrew Bogott: Revert "Cloud vms: enable a default tty" [puppet] - 10https://gerrit.wikimedia.org/r/490272 [03:54:48] (03PS2) 10Andrew Bogott: Revert "Cloud vms: enable a default tty" [puppet] - 10https://gerrit.wikimedia.org/r/490272 [04:01:57] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Cloud vms: enable a default tty" [puppet] - 10https://gerrit.wikimedia.org/r/490272 (owner: 10Andrew Bogott) [06:02:03] (03PS2) 10Marostegui: dbstore1005: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/490068 (https://phabricator.wikimedia.org/T210478) [06:04:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490275 (https://phabricator.wikimedia.org/T210713) [06:04:33] (03CR) 10Marostegui: [C: 03+2] dbstore1005: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/490068 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:11:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490275 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:12:04] !log Stop MySQL on db2085 to keep debugging kernel issues - T214840 [06:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:07] T214840: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 [06:15:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490275 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:17:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 (duration: 01m 07s) [06:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:21] !log Deploy schema change on db1104 - T210713 [06:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:24] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:19:21] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215994 [06:19:25] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215994 (10ops-monitoring-bot) [06:20:23] (03PS1) 10Vgutierrez: Release 0.10 [software/certcentral] - 10https://gerrit.wikimedia.org/r/490277 (https://phabricator.wikimedia.org/T215925) [06:21:14] (03CR) 10Vgutierrez: [C: 03+2] Release 0.10 [software/certcentral] - 10https://gerrit.wikimedia.org/r/490277 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:22:25] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Kelson) @akosiaris Problem seems to be fixed from our end. Thank you very much. [06:23:01] (03Merged) 10jenkins-bot: Release 0.10 [software/certcentral] - 10https://gerrit.wikimedia.org/r/490277 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:25:06] (03CR) 10jenkins-bot: Release 0.10 [software/certcentral] - 10https://gerrit.wikimedia.org/r/490277 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:25:11] (03PS1) 10Vgutierrez: acme-chief: Bump to buster [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490278 (https://phabricator.wikimedia.org/T215925) [06:25:13] (03PS1) 10Vgutierrez: Release 0.10 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490279 (https://phabricator.wikimedia.org/T215925) [06:26:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490275 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:26:16] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Bump to buster [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490278 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:26:22] (03CR) 10Vgutierrez: [C: 03+2] Release 0.10 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490279 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:28:00] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Good news: it seems that Python 3.7 support should be available with 1.13, the next release - https://github.com/tensorflow/tensorflow/issues/20517 [06:28:25] (03PS1) 10Vgutierrez: debian: Add release 0.10 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490280 (https://phabricator.wikimedia.org/T215925) [06:29:17] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-intel-microcode] [06:30:50] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.10 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490280 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:34:55] (03Merged) 10jenkins-bot: acme-chief: Bump to buster [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490278 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:35:06] (03Merged) 10jenkins-bot: Release 0.10 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490279 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:35:28] (03Merged) 10jenkins-bot: debian: Add release 0.10 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490280 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:36:53] (03CR) 10jenkins-bot: acme-chief: Bump to buster [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490278 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:37:04] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db2085: So I can confirm that the BIOS setting for Serial Communication is being sent to COM2 (which is ttyS1). Which is the same as: ` linux... [06:37:07] (03CR) 10jenkins-bot: Release 0.10 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490279 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:37:09] (03CR) 10jenkins-bot: debian: Add release 0.10 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490280 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [06:46:40] !log uploaded acme-chief 0.10 to apt.wikimedia.org (buster) - T215925 [06:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:44] T215925: Upgrade acme-chief to run in debian buster - https://phabricator.wikimedia.org/T215925 [06:49:54] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215996 [06:49:59] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215996 (10ops-monitoring-bot) [06:56:36] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db2085 with kernel 4.9.0-7-amd64 reboots, another FAIL at the 6th and 7th reboot (similar patter as with kernel -9 at T214840#4948016): 1st reboot:... [07:00:18] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:52] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) Current BIOS setting: {F28207120} [07:27:14] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db2085: `debug` added to the kernel boot, to see if we catch something `` ` linux /boot/vmlinuz-4.9.0-7-amd64 root=UUID=63e5ddbd-3c18-4bf5-ad22-88458... [07:30:30] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215997 [07:30:33] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215997 (10ops-monitoring-bot) [07:30:38] (03CR) 10Giuseppe Lavagetto: "The debian-glue failure is due to the use of "stretch-wikimedia" as a distribution in the changelog." [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/490109 (owner: 10Giuseppe Lavagetto) [07:50:28] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) So I had to install all the packages listed above to make Tensoflow run, each time it was failing for a different missing lib. Now I am getting this: ` (tes... [07:54:46] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) ` root@stat1005:/home/elukey/test# /opt/rocm/bin/rocm-smi ======================== ROCm System Management Interface ========================... [07:55:47] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) Tensorflow is also finding it's way into Debian, BTW (currently only in experimental): https://packages.qa.debian.org/t/tensorflow.html [07:59:07] (03CR) 10WMDE-leszek: [C: 04-1] DNM Define Wikibase "entity sources" on beta commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490108 (https://phabricator.wikimedia.org/T214557) (owner: 10WMDE-leszek) [07:59:35] (03PS1) 10Vgutierrez: authdns: Allow acme-chief servers to fulfill ACME DNS challenges [puppet] - 10https://gerrit.wikimedia.org/r/490286 (https://phabricator.wikimedia.org/T207389) [08:01:04] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10fgiunchedi) [08:01:13] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10fgiunchedi) p:05Triage→03High [08:01:49] ACKNOWLEDGEMENT - Host ms-be1033 is DOWN: PING CRITICAL - Packet loss = 100% Filippo Giunchedi T215998 [08:04:20] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/14637/" [puppet] - 10https://gerrit.wikimedia.org/r/490286 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:04:32] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db2085 reboots with 4.9.0-7 with debug enabled - all fine: 1st reboot: OK 2nd reboot: OK 3rd reboot: OK 4th reboot: OK 5th reboot: OK 6th reboot: OK... [08:04:38] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) @RobH thanks a lot for the tests and follow up! Just to clarify a detail regarding the "normal" reboot messages, at least in a couple of occasions (for the others I'm not f... [08:06:28] PROBLEM - MariaDB Slave Lag: s8 on db1104 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6474.39 seconds [08:06:46] (03PS1) 10Filippo Giunchedi: prometheus: use eqiad for mirrormaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/490287 [08:06:50] came out from downtime [08:06:52] (03PS4) 10Muehlenhoff: service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 [08:06:53] mmmh expected? [08:06:54] going to downtime again [08:06:55] ah ok [08:06:57] ah [08:07:16] ack [08:07:22] sorry for the noise [08:07:22] (03CR) 10jerkins-bot: [V: 04-1] prometheus: use eqiad for mirrormaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/490287 (owner: 10Filippo Giunchedi) [08:08:12] (03PS2) 10Filippo Giunchedi: prometheus: use eqiad for mirrormaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/490287 [08:09:27] (03PS12) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [08:09:36] elukey ottomata https://gerrit.wikimedia.org/r/c/operations/puppet/+/490287 [08:11:38] (03CR) 10Elukey: [C: 03+1] prometheus: use eqiad for mirrormaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/490287 (owner: 10Filippo Giunchedi) [08:11:43] thanks! [08:12:50] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use eqiad for mirrormaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/490287 (owner: 10Filippo Giunchedi) [08:12:58] (03PS3) 10Filippo Giunchedi: prometheus: use eqiad for mirrormaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/490287 [08:12:59] np, that should do it [08:13:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490288 [08:13:55] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Access request: Ladsgroup to analytics-wmde-users - https://phabricator.wikimedia.org/T215938 (10Addshore) I believe this has to be in #sre-access-requests [08:14:30] marostegui: just to be clear and avoid any doubt... was the downtime expiration "correct"? [08:14:41] asking because of recent icinga failover ;) [08:15:38] volans: yeah, it was [08:16:33] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490288 (owner: 10Marostegui) [08:16:33] RECOVERY - MariaDB Slave Lag: s8 on db1104 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:17:45] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490288 (owner: 10Marostegui) [08:18:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 (duration: 00m 53s) [08:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:01] ack thx [08:22:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490288 (owner: 10Marostegui) [08:24:27] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db2085 reboots with 4.9.0-8 with debug enabled: 1st reboot: OK 2nd reboot: FAIL (unfortunately no kernel error trace just an automatic reboot) How... [08:40:36] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db2085 got stuck when booting up on: ` [ 0.560579] x86: Booting SMP configuration: [ 0.565246] .... node #1, CPUs: #1 [ 0.674090] ..... [08:56:58] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:01] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [08:58:07] elukey: was that you? ^^ [08:59:19] yessss [08:59:21] sorry :) [08:59:29] I am fighting with the GPU [08:59:34] it is winning of course [09:00:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) a:03Papaul After power cycling db2085, this is what happened: reboot: OK reboot: OK reboot: FAIL reboot: FAIL reboot: FAIL Error on post: ` Enumer... [09:00:45] elukey: expected :D [09:02:06] (03PS5) 10Muehlenhoff: service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 [09:02:25] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T216004 [09:02:29] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10ops-monitoring-bot) [09:05:20] arturo: ^^ [09:10:59] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.58 seconds [09:11:58] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) I switched to the `rocm-dkml` kernel drivers and followed instructions for https://wiki.archlinux.org/index.php/AMDGPU#Set_required_module_parameters (Sea Is... [09:13:07] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490291 (https://phabricator.wikimedia.org/T214840) [09:15:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490291 (https://phabricator.wikimedia.org/T214840) (owner: 10Marostegui) [09:16:06] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) I started to format the spreadsheat in a way that can be imported in Netbox: https://docs.google.com/spreadsheets/d/1FKYVQJePjTQ7nVwYv4oDC6Gszk7RLrkq5ySN0fjvSoY/edit#gid=1665726692 The plan is to then delete the e... [09:20:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490291 (https://phabricator.wikimedia.org/T214840) (owner: 10Marostegui) [09:21:05] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:47] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:21:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1106 T214840 (duration: 00m 53s) [09:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:53] T214840: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 [09:22:10] !log Stop MySQL on db1106 - T214840 [09:22:11] stat1005 is me :) [09:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:13] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [09:23:36] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10fgiunchedi) >>! In T214529#4947253, @CDanis wrote: > Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of serve... [09:26:41] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:26:55] !log labsdb1005 stopped mysql [09:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:28] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#4949742, @Nuria wrote: > @fgiunchedi thoughts on this? looks like we are talking about 1... [09:28:36] (03PS3) 10Gilles: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) [09:28:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490291 (https://phabricator.wikimedia.org/T214840) (owner: 10Marostegui) [09:28:42] (03CR) 10Gilles: Set expiry headers on thumbnails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [09:33:50] !log labsdb1005 rebooted server [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:27] ACKNOWLEDGEMENT - Host labsdb1005 is DOWN: PING CRITICAL - Packet loss = 100% GTirloni Restarted [09:35:32] (03CR) 10Filippo Giunchedi: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [09:37:13] (03CR) 10Muehlenhoff: [C: 03+1] "One comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490069 (owner: 10Alexandros Kosiaris) [09:42:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Looks fine to me, but one question I got is why switch from the old 240 secs to 200?" [puppet] - 10https://gerrit.wikimedia.org/r/490077 (owner: 10Giuseppe Lavagetto) [09:43:03] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Miriam) > In the hundreds of megabytes I believe. @Halfak, @EBernhardson, @Miriam, @bmansurov, is this right? Will... [09:44:33] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga2001 is CRITICAL: 57.7 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:46:47] !log installing golang security updates [09:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:09] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:59] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [09:55:25] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 41.95 seconds [09:59:25] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga2001 is OK: (C)60 le (W)70 le 79.65 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:03:26] (03PS1) 10Volans: netbox: remove admins email [puppet] - 10https://gerrit.wikimedia.org/r/490292 [10:05:21] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) db1106 with 4.9.0-8 with debug enabled on the kernel, reboots sequence: 1st reboot: FAIL 2nd reboot: OK 3rd reboot: OK 4th reboot: FAIL 5th reboot: O... [10:05:31] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [10:05:48] elukey: can I suggest a few-days long downtime for it? ;) [10:05:59] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [10:06:36] sure [10:06:59] it didn't seem to me a big problem :) [10:08:03] not a big problem, just to avoid people wondering what it is and pinging you to check if is expected or not ;) [10:08:14] or we simply disable notifications in general, as it's currently just a test host [10:11:38] (03PS1) 10Muehlenhoff: Enable hwraid repos for wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/490294 [10:13:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "+Moritz too, LGTM as it is only temporary" [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [10:16:00] (03PS2) 10Volans: netbox: set ADMINS to default empty list [puppet] - 10https://gerrit.wikimedia.org/r/490292 [10:18:56] (03PS1) 10Arturo Borrero Gonzalez: hiera: role: wmcs: monitoring: introduce password placeholders [puppet] - 10https://gerrit.wikimedia.org/r/490295 (https://phabricator.wikimedia.org/T215968) [10:21:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Compiler is now happy https://puppet-compiler.wmflabs.org/compiler1001/14640/" [puppet] - 10https://gerrit.wikimedia.org/r/490295 (https://phabricator.wikimedia.org/T215968) (owner: 10Arturo Borrero Gonzalez) [10:22:37] (03CR) 10Muehlenhoff: "Looks fine, but I wonder if it's even necessary? We won't install additional trusty systems at this point, so with the status quo, the exi" [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [10:23:51] (03PS1) 10DCausse: Plugins for elastic 5.6.14 [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/490296 (https://phabricator.wikimedia.org/T215932) [10:24:29] (03PS1) 10Arturo Borrero Gonzalez: Revert "hiera: role: wmcs: monitoring: introduce password placeholders" [puppet] - 10https://gerrit.wikimedia.org/r/490297 [10:26:42] (03PS1) 10Arturo Borrero Gonzalez: hiera: role: wmcs: monitoring: introduce password placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/490298 (https://phabricator.wikimedia.org/T215968) [10:26:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "hiera: role: wmcs: monitoring: introduce password placeholders" [puppet] - 10https://gerrit.wikimedia.org/r/490297 (owner: 10Arturo Borrero Gonzalez) [10:28:14] (03PS1) 10ArielGlenn: showcrcs: util to write out crc information from a bzip2 file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/490299 (https://phabricator.wikimedia.org/T216009) [10:28:33] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: role: wmcs: monitoring: introduce password placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/490298 (https://phabricator.wikimedia.org/T215968) (owner: 10Arturo Borrero Gonzalez) [10:28:59] (03CR) 10GTirloni: "Just for learning purposes, why are these variables defined here instead of in labs/private? I expected that repo to contain the bogus sec" [puppet] - 10https://gerrit.wikimedia.org/r/490295 (https://phabricator.wikimedia.org/T215968) (owner: 10Arturo Borrero Gonzalez) [10:29:10] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10Joe) I've just noticed, based on a diffscan email, that the new version of prometheus-node-exporter ALSO binds to `:::9100` on ipv6 and listens to al... [10:30:01] (03CR) 10Volans: "compiler output: https://puppet-compiler.wmflabs.org/compiler1002/14639/" [puppet] - 10https://gerrit.wikimedia.org/r/490292 (owner: 10Volans) [10:30:55] (03PS3) 10Alexandros Kosiaris: default egress policy: Allow kafka/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/490083 [10:32:48] 10Operations, 10Puppet, 10Discovery-Search, 10Maps: Fix maps puppet to make sure apt-get update runs after configuration change - https://phabricator.wikimedia.org/T214073 (10Joe) @gehel I would say what's missing is a clear dependency between the installation of the cassandra package and the apt-get updat... [10:33:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Just added gerrit and tested the rules in staging. Seems to work fine. Note that our code will default to allow all if it fails to load th" [puppet] - 10https://gerrit.wikimedia.org/r/490083 (owner: 10Alexandros Kosiaris) [10:37:50] (03PS2) 10Muehlenhoff: Enable hwraid repos for wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/490294 [10:39:00] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/490301 [10:39:46] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) I have booted again without `amdgpu.dc=0` to reduce the number of variables (since basically it should be related to how to handle an external screen, harmle... [10:41:22] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) Maybe try 4.20-1 from experimental to narrow the kernel oops down? [10:42:29] (03PS1) 10Gehel: cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) [10:43:23] (03PS2) 10Gehel: cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) [10:49:09] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/490301 (owner: 10Volans) [10:50:16] (03PS1) 10Filippo Giunchedi: prometheus: use web_listen_address with node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/490304 [10:53:21] (03PS2) 10Filippo Giunchedi: prometheus: use web_listen_address with node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/490304 [10:54:19] (03PS2) 10Alexandros Kosiaris: Add a simple README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/490027 [10:54:35] (03CR) 10Alexandros Kosiaris: "@Ariel good point, added a line about it" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/490027 (owner: 10Alexandros Kosiaris) [10:55:44] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/490301 [10:57:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [10:58:39] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14644/lvs5002.eqsin.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/490304 (owner: 10Filippo Giunchedi) [10:59:08] (03PS3) 10Filippo Giunchedi: prometheus: use web_listen_address with node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/490304 (https://phabricator.wikimedia.org/T213708) [11:03:16] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/490301 (owner: 10Volans) [11:03:42] (03CR) 10Filippo Giunchedi: [C: 03+1] cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) (owner: 10Gehel) [11:04:26] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10fgiunchedi) >>! In T213708#4950462, @Joe wrote: > I've just noticed, based on a diffscan email, that the new version of prometheus-node-exporter ALSO... [11:05:02] (03PS3) 10Muehlenhoff: Enable hwraid repos for wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/490294 [11:06:14] (03PS4) 10Muehlenhoff: Enable hwraid repos for wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/490294 [11:06:53] (03PS5) 10Muehlenhoff: Enable hwraid repos for wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/490294 [11:07:20] (03CR) 10Alexandros Kosiaris: "I guess it depends highly on the licenses and the interactions between the charts." [deployment-charts] - 10https://gerrit.wikimedia.org/r/490028 (owner: 10Alexandros Kosiaris) [11:07:55] (03CR) 10Muehlenhoff: [C: 03+2] Enable hwraid repos for wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/490294 (owner: 10Muehlenhoff) [11:08:55] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/490301 (owner: 10Volans) [11:09:40] (03CR) 10Volans: "It's a lot of code and it's hard to check if it was "copied" over correctly. Globally it looks sane to me, I've left some nitpick/minor co" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:11:34] !log installing postgis security updates [11:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:47] (03PS1) 10Volans: Upstream release v0.0.14 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/490306 [11:12:23] (03PS2) 10Jcrespo: mariadb: Depool db1120 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490020 [11:12:56] (03CR) 10Elukey: [C: 03+1] cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) (owner: 10Gehel) [11:13:08] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.14 [software/spicerack] - 10https://gerrit.wikimedia.org/r/490301 (owner: 10Volans) [11:14:21] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) @jijiki what is the total number of items stored... [11:14:26] 10Operations, 10ops-eqiad: kafka1012 power supply alerts - https://phabricator.wikimedia.org/T216011 (10fgiunchedi) [11:14:55] ACKNOWLEDGEMENT - IPMI Sensor Status on kafka1012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Filippo Giunchedi T216011 [11:15:22] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1120 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490020 (owner: 10Jcrespo) [11:16:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, for the general failure of the previous approach we should check apt-get update status code with puppet agent --debug next time" [puppet] - 10https://gerrit.wikimedia.org/r/490204 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [11:16:30] (03Merged) 10jenkins-bot: mariadb: Depool db1120 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490020 (owner: 10Jcrespo) [11:16:53] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: upgrade prometheus-node-exporter to 0.17 in esams [puppet] - 10https://gerrit.wikimedia.org/r/490229 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [11:19:35] (03CR) 10ArielGlenn: [C: 03+1] "Looks good now, thumbs up from me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/490027 (owner: 10Alexandros Kosiaris) [11:20:00] (03CR) 10jenkins-bot: mariadb: Depool db1120 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490020 (owner: 10Jcrespo) [11:22:21] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Thanks, @santhosh ! [11:23:26] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.14 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/490306 (owner: 10Volans) [11:23:50] 10Operations, 10Traffic, 10VisualEditor, 10Wikimedia-Apache-configuration: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200) - https://phabricator.wikimedia.org/T213214 (10matmarex) [11:25:21] 10Operations, 10Traffic, 10VisualEditor, 10Wikimedia-Apache-configuration: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200) - https://phabricator.wikimedia.org/T213214 (10matmarex) >>! In T214534#4950440, @Schnark wrote: > https://stackoverflow.com/questions/33867014/what-does-er... [11:29:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [11:29:43] (03Merged) 10jenkins-bot: Upstream release v0.0.14 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/490306 (owner: 10Volans) [11:33:44] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1120 (duration: 00m 53s) [11:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:27] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:38:15] !log uploaded spicerack_0.0.14-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [11:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:18] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Installed 4.20 from experimental but it seems that the kfd driver is not shipped: ` elukey@stat1005:~$ find /lib/modules/ -type f -name '*.ko' | grep kfd /l... [11:40:51] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) >>! In T148843#4950613, @elukey wrote: > Installed 4.20 from experimental but it seems that the kfd driver is not shipped: > > ` > elukey@stat100... [11:41:53] !log upgraded spicerack on cumin[12]001 to v0.0.14 [11:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:58] onimisionipe, gehel ^^^ [11:42:19] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) >>! In T196478#4919507, @Cmjohnson wrote: > @akosiaris Sorry for the really late response to this....the task got buried. No, I don't know why mgmt would not be working now u... [11:43:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [11:43:49] !log installing golang updates on jessie [11:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:20] (03PS2) 10Elukey: role::analytics_test_cluster::coord: add kafkatee instance [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) [11:48:12] (03PS2) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [11:48:30] volans: Thanks! [11:49:37] !log stop and upgrade db1120 [11:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:15] (03CR) 10jerkins-bot: [V: 04-1] Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:55:35] !log installing avahi security updates [11:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "> Just for learning purposes, why are these variables defined here" [puppet] - 10https://gerrit.wikimedia.org/r/490295 (https://phabricator.wikimedia.org/T215968) (owner: 10Arturo Borrero Gonzalez) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:24] volans: thanks ! [12:00:28] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) >>! In T205897#4950310, @ayounsi wrote: > The plan is to then delete the existing cables and import them again with all their attributes. Sounds good to me this one-time import. > Some notes/questions: > * How sh... [12:00:56] gehel, onimisionipe: you welcome, I did not perform extensive tests yet, I'll probably do it later on today [12:01:37] I'll do some testing as well [12:08:05] (03PS3) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [12:10:13] PROBLEM - Disk space on cloudvirt1018 is CRITICAL: DISK CRITICAL - /var/lib/nova/instances is not accessible: Input/output error [12:10:41] womp womp [12:10:43] <_joe_> ok, why is this paging me? [12:10:58] <_joe_> also, clearly an issue, but why is it paging everyone? [12:11:13] <_joe_> didn't we remove the critical-for-all for cloudvirt? [12:11:22] (03PS4) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [12:11:46] <_joe_> the disk is gone on cloudvirt1018 [12:11:48] degraded raid already autofiled for that host [12:11:49] not sure, looks legit yeah [12:12:04] https://phabricator.wikimedia.org/T216004 [12:12:07] ACKNOWLEDGEMENT - Disk space on cloudvirt1018 is CRITICAL: DISK CRITICAL - /var/lib/nova/instances is not accessible: Input/output error Arturo Borrero Gonzalez T216004 [12:12:13] <_joe_> well now it's worse than degraded [12:12:14] <_joe_> :P [12:12:36] <_joe_> arturo: do you know how you set that alert to paging? [12:13:02] no idea, didn't check [12:22:53] PROBLEM - Check systemd state on cloudvirt1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:23:35] (03PS5) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [12:25:52] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) so the amdkfd module is built by the `rock-dkml` package (when installing) grabbing the current kernel headers. Since I didn't install them (for both kernels... [12:29:31] (03Abandoned) 10Muehlenhoff: Also limit file size in mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/481140 (owner: 10Muehlenhoff) [12:29:47] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for SSH [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) [12:32:17] !log T216030 T216004 rebooting cloudvirt1018 [12:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:21] T216030: cloudvps: evaluate draining cloudvirt1018 - https://phabricator.wikimedia.org/T216030 [12:32:21] T216004: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 [12:33:05] PROBLEM - Host cloudvirt1018 is DOWN: PING CRITICAL - Packet loss = 100% [12:33:24] could you please set downtime before you reboot hosts that page? :) [12:33:39] * apergos peeks in... ah [12:33:49] ACKNOWLEDGEMENT - Host cloudvirt1018 is DOWN: PING CRITICAL - Packet loss = 100% Arturo Borrero Gonzalez T216030 [12:34:42] !log T216030 icinga downtime cloudvirt1018 for 2 weeks [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:17] (03PS2) 10Jcrespo: mariadb: Depool es1014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490021 [12:35:28] mark: done, sorry for the noise [12:36:30] thanks :) [12:46:00] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10aborrero) After a reboot, I got this in the console: ` 13:43 The following VDs have missing disks: 000 13:43 If you proceed (or load the configuration utility), these VDs 13:43 T... [12:47:37] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10aborrero) {F28209182} There are apparently 2 failed disks? [12:51:03] (03PS3) 10Jcrespo: mariadb: Depool es1014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490021 [12:52:21] (03CR) 10Giuseppe Lavagetto: "I think this was long overdue, so thanks for embarking in the process. I have some comments - some general, some nitpicks, and a couple th" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [12:53:06] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10Cmjohnson) A ticket with Dell has been created You have successfully submitted request SR986375888. [12:56:37] jenkins is not building my patch https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/488256 [12:57:54] (03CR) 10Muehlenhoff: [C: 03+1] toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) (owner: 10Bstorm) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T1300) [13:00:53] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool es1014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490021 (owner: 10Jcrespo) [13:02:26] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Cmjohnson) a ticket with Dell has been submitted to replace both SSDs You have successfully submitted request SR986376069. [13:03:40] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) For those following along, I ran a query to get a sense of global usage of Google... [13:06:19] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10aborrero) [13:07:35] (03CR) 10DCausse: [C: 04-1] "hebrew seems to be broken" [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/490296 (https://phabricator.wikimedia.org/T215932) (owner: 10DCausse) [13:20:28] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Depool es1014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490021 (owner: 10Jcrespo) [13:20:51] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215989 (10Volans) [13:20:55] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Volans) [13:21:05] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215994 (10Volans) [13:21:07] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Volans) [13:21:20] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215996 (10Volans) [13:21:22] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Volans) [13:21:32] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T215997 (10Volans) [13:21:34] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Volans) [13:22:56] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Volans) I've merged the other auto-generated tasks into this one. FYI, as you can see on some of the others (where NRPE didn't timeout) the RAID was rebuilding. [13:25:35] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10Volans) [13:32:53] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1120 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490321 [13:33:00] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215718 (10Volans) [13:33:02] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10Volans) [13:33:23] (03PS21) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [13:35:35] (03CR) 10jenkins-bot: mariadb: Depool es1014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490021 (owner: 10Jcrespo) [13:37:29] RECOVERY - Host cloudvirt1018 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [13:37:51] RECOVERY - Check systemd state on cloudvirt1018 is OK: OK - running: The system is fully operational [13:38:05] RECOVERY - Disk space on cloudvirt1018 is OK: DISK OK [13:38:19] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for SSH [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:38:25] RECOVERY - MegaRAID on cloudvirt1018 is OK: OK: optimal, 1 logical, 10 physical [13:38:38] pages for the recovery :-D ah well [13:40:45] (03CR) 10Jbond: "Thanks for the review" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [13:41:28] (03PS22) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [13:41:50] (03PS1) 10Filippo Giunchedi: install_server: use stretch for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/490325 (https://phabricator.wikimedia.org/T187987) [13:42:31] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1120 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490321 [13:43:07] PROBLEM - puppet last run on labtestnet2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:21] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:41] (03PS1) 10Zppix: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 [13:43:55] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:53] PROBLEM - puppet last run on labtestnet2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:45:09] meh, fixing [13:45:30] (03PS2) 10Zppix: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 [13:46:36] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10BBlack) >>! In T205897#4950741, @Volans wrote: >>>! In T205897#4950310, @ayounsi wrote: >> * How should we name server interfaces? The physical Port 1, Port 2, etc. or the Linux naming (enp5s0f0, enp5s0f1, etc) >> My vote... [13:47:08] (03PS1) 10Muehlenhoff: Make base::service_auto_restart for SSH conditional on jessie as it depends on systemd [puppet] - 10https://gerrit.wikimedia.org/r/490327 [13:47:39] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:48:21] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:48:25] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:48:37] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:48:48] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10Volans) It seems that PD3 is totally gone, from `sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli -a`: ` PD: 3 Information PD: 4 Information Enclosure Device ID... [13:49:30] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10aborrero) The server is back online: ` 14:43 slot 2: failed / slot 3: offline. I forced slot 2 offline and then online. I then forced slot 3 online. Reboot serve... [13:50:14] argh, test-prio in zuul is backing up, waiting on https://integration.wikimedia.org/ci/job/operations-mw-config-composer-test-docker/12236/ [13:50:15] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:27] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/490327 (owner: 10Muehlenhoff) [13:51:15] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:54] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Make base::service_auto_restart for SSH conditional on jessie as it depends on systemd [puppet] - 10https://gerrit.wikimedia.org/r/490327 (owner: 10Muehlenhoff) [13:54:19] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:27] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:28] (03PS2) 10DCausse: Plugins for elastic 5.6.14 [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/490296 (https://phabricator.wikimedia.org/T215932) [13:55:16] (03PS2) 10Filippo Giunchedi: install_server: use stretch for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/490325 (https://phabricator.wikimedia.org/T187987) [13:55:53] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:58] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "test-prio in CI is backed up at the moment" [puppet] - 10https://gerrit.wikimedia.org/r/490325 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [13:56:09] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:29] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:35] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T1400) [14:03:06] (03PS3) 10Gehel: cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) [14:03:31] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "mariadb: Depool db1120 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490321 (owner: 10Jcrespo) [14:03:41] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:05:45] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120, depool es1014 (duration: 00m 52s) [14:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:12] !log cancel https://integration.wikimedia.org/ci/job/operations-mw-config-composer-test-docker/12236 to unblock test-prio zuul queue [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [14:08:39] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1120 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490321 (owner: 10Jcrespo) [14:10:52] (03PS23) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [14:13:38] (03PS1) 10Elukey: Set Debian installer to Stretch for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/490329 (https://phabricator.wikimedia.org/T148843) [14:13:47] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:14:31] RECOVERY - puppet last run on labtestnet2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:14:33] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:14:45] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:14:47] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:15] RECOVERY - puppet last run on labtestnet2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:16:23] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:17:25] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:34] (03PS13) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [14:18:38] (03CR) 10Elukey: [V: 03+2 C: 03+2] Set Debian installer to Stretch for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/490329 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:18:59] (03CR) 10Vgutierrez: "Thanks for the review <3" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:19:13] is there any issue with jenkins? [14:19:25] there is, I'm writing a task atm [14:19:32] thanks a lot :) [14:19:43] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:20:44] 10Operations, 10Continuous-Integration-Infrastructure: jenkins / zuul backing up due to jenkins slaves down - https://phabricator.wikimedia.org/T216039 (10fgiunchedi) [14:20:49] elukey: ^ [14:20:53] <3 [14:21:16] ahhh due to the cloudvirt instance down [14:21:18] yes yes [14:21:19] makes sense [14:21:46] (03CR) 10Vgutierrez: [C: 03+1] sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [14:21:51] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:25:06] !log reimage stat1005 back to stretch to test GPU drivers [14:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:43] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:25:53] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:27:19] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:27:30] o/ if any one can help me figure out how to regain my 'ottomata' nick, i'd be much obligeed [14:27:35] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:28:38] (03PS11) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [14:29:46] 10Operations, 10Continuous-Integration-Infrastructure: jenkins / zuul backing up due to jenkins slaves down - https://phabricator.wikimedia.org/T216039 (10fgiunchedi) p:05Triage→03High [14:31:07] (03CR) 10Vgutierrez: [C: 03+1] "> The secrets have already been duplicated in the private repo?" [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:33:11] so yeah don't wait on jenkins for puppet ATM [14:33:45] going to take a break to cook, back in 45 minutes or so [14:33:52] ottomata_: o/ - I usually (from irssi) re-authenticate with nick server and set nick with /nick [14:33:59] (should provide food for several days) [14:34:26] (03CR) 10jerkins-bot: [V: 04-1] cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) (owner: 10Gehel) [14:35:23] (03CR) 10Ottomata: "Great thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/490287 (owner: 10Filippo Giunchedi) [14:36:31] elukey: problem is that freenode thinks ottomata is in use [14:36:37] but nickserv doesn't have ottomata registered [14:37:07] because I DROPed it from nickserv when I was trying to associate it to a different email address (aotto@ instead of otto@) [14:37:16] (that was a mistake I know now) [14:37:30] I need to become ottomata in order to register with nickserv [14:37:37] ah yes so there is a way to force nick serv to log you off, if you authenticate [14:37:54] nickserv isn't relevant here. nickserv says 'ottomata is not a registered nick' [14:37:57] but freenode says [14:37:59] ottomata is in use [14:38:19] (03PS1) 10Gehel: elasticsearch: check mean shard size intead of largest [puppet] - 10https://gerrit.wikimedia.org/r/490333 [14:38:29] so i need to force freenode to log ottomata off [14:38:45] i can't do anything with ottomata on nickserv, because it isn't registered [14:39:25] I have no idea then [14:39:29] (03PS24) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [14:40:03] (03PS4) 10Gehel: cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) [14:41:18] 10Operations, 10Continuous-Integration-Infrastructure: jenkins / zuul backing up due to jenkins slaves down - https://phabricator.wikimedia.org/T216039 (10thcipriani) Ugh. For the moment (while integration-castor03 is down) I've modified castor-save-workspace-cache to be a no-op (exit 0) and run on nodes label... [14:41:20] 10Operations: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10MoritzMuehlenhoff) [14:42:42] 10Operations: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10CDanis) I'm not sure if it's helpful here, or if you know this already, but there is a `raid` fact in facter that is an array of the types of RAID on a given machine. [14:42:44] (03CR) 10Alexandros Kosiaris: "Sure, but I am not sure I understand the intent behind the last comment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [14:45:05] 10Operations: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10MoritzMuehlenhoff) Yeah, I know, that one correctly pulls in a number of debs which are actually in puppet, but there's a number of additional ones which need a closer look (e.g. hpacucli or hpssa which are... [14:45:26] (03CR) 10Gehel: [C: 03+2] "Checksums verified, all looks good." [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/490296 (https://phabricator.wikimedia.org/T215932) (owner: 10DCausse) [14:45:36] (03Abandoned) 10Fsero: (WIP) local puppet compiler docker-compose [puppet] - 10https://gerrit.wikimedia.org/r/477583 (owner: 10Fsero) [14:46:21] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:46:34] (03PS14) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [14:46:50] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10Volans) Sorry I have to amend what I said above, both PD0 and PD3 are missing. I'm sending a patch to improve the get-raid-status-megacli script. With it the new output would have b... [14:47:36] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:48:51] (03CR) 10Vgutierrez: [C: 03+2] site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:48:59] (03PS12) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [14:49:51] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:53:50] !log otto@deploy1001 scap-helm eventgate-analytics install -n staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [14:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:52] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:53:52] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:37] (03PS25) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [15:00:23] (03PS1) 10Volans: raid_handler: fix reported executed script [puppet] - 10https://gerrit.wikimedia.org/r/490337 [15:00:25] (03PS1) 10Volans: raid: improve megacli get raid script [puppet] - 10https://gerrit.wikimedia.org/r/490338 [15:02:10] (03CR) 10Volans: "example output available here: https://phabricator.wikimedia.org/T215892#4951158" [puppet] - 10https://gerrit.wikimedia.org/r/490338 (owner: 10Volans) [15:04:10] (03CR) 10Volans: "An example of the wrong reported executable is available here:" [puppet] - 10https://gerrit.wikimedia.org/r/490337 (owner: 10Volans) [15:05:11] !log akosiaris@deploy1001 scap-helm eventgate-analytics install -n staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics --dry-run --debug [namespace: eventgate-analytics, clusters: staging] [15:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:26] !log akosiaris@deploy1001 scap-helm eventgate-analytics install --dry-run --debug -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:05:26] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:05:26] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [15:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:16] (03PS1) 10Vgutierrez: uwsgi: Take buster into account [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) [15:08:26] !log akosiaris@deploy1001 scap-helm eventgate-analytics install --dry-run --debug -f /srv/scap-helm/eventgate/eventgate-analytics-staging-values.yaml ../ [namespace: eventgate-analytics, clusters: staging] [15:08:26] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:08:27] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [15:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:48] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: check mean shard size intead of largest [puppet] - 10https://gerrit.wikimedia.org/r/490333 (owner: 10Gehel) [15:09:11] (03PS9) 10Volans: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [15:09:58] !log akosiaris@deploy1001 scap-helm eventgate-analytics install -f /srv/scap-helm/eventgate/eventgate-analytics-staging-values.yaml ../ [namespace: eventgate-analytics, clusters: staging] [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:27] !log akosiaris@deploy1001 scap-helm eventgate-analytics install -f /srv/scap-helm/eventgate/eventgate-analytics-staging-values.yaml --set service.port=31193 ../ [namespace: eventgate-analytics, clusters: staging] [15:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:28] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:10:28] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [15:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:59] (03CR) 10Filippo Giunchedi: "Thanks for the POC!" [puppet] - 10https://gerrit.wikimedia.org/r/490193 (owner: 10Herron) [15:11:09] (03CR) 10Volans: [C: 03+2] sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [15:12:16] (03CR) 10Muehlenhoff: [C: 04-1] uwsgi: Take buster into account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [15:13:25] (03Merged) 10jenkins-bot: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [15:13:34] (03CR) 10Muehlenhoff: [C: 04-1] uwsgi: Take buster into account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [15:15:35] (03CR) 10CRusnov: [C: 03+1] "Looks good. Really straight forward change." [puppet] - 10https://gerrit.wikimedia.org/r/490292 (owner: 10Volans) [15:16:29] (03PS2) 10Vgutierrez: uwsgi: Take buster into account [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) [15:16:39] (03PS3) 10Volans: netbox: set ADMINS to default empty list [puppet] - 10https://gerrit.wikimedia.org/r/490292 [15:16:59] !log updated thirdparty/php72 component to PHP 7.2.15 [15:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:07] !log akosiaris@deploy1001 scap-helm eventgate-analytics install -f /srv/scap-helm/eventgate/eventgate-analytics-staging-values.yaml --set service.port=31193 ../ [namespace: eventgate-analytics, clusters: staging] [15:17:08] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:08] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [15:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:35] (03CR) 10Volans: [C: 03+2] netbox: set ADMINS to default empty list [puppet] - 10https://gerrit.wikimedia.org/r/490292 (owner: 10Volans) [15:17:37] (03CR) 10Muehlenhoff: [C: 03+1] uwsgi: Take buster into account [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [15:17:49] (03PS3) 10Vgutierrez: uwsgi: Take buster into account [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) [15:19:20] PROBLEM - Check correctness of the icinga configuration on icinga2001 is CRITICAL: Icinga configuration contains errors [15:20:04] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/490077 (owner: 10Giuseppe Lavagetto) [15:20:41] vgutierrez: Error: Could not find any hostgroup matching 'acmechief_codfw' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 469) [15:20:44] ^^^ [15:20:48] 10Operations, 10Discovery-Search (Current work): Create new elastic56 component in reprepro and upload elasticsearch and plugins - https://phabricator.wikimedia.org/T216047 (10Gehel) p:05Triage→03High [15:21:20] volans: on it, thx [15:21:33] (03PS1) 10Gehel: aptrepo: create new elastic56 component [puppet] - 10https://gerrit.wikimedia.org/r/490341 (https://phabricator.wikimedia.org/T216047) [15:22:08] (03PS1) 10Alexandros Kosiaris: eventgate: Add single quotes and don't chomp whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/490342 [15:22:10] (03PS1) 10Alexandros Kosiaris: Bump eventgate-analytics version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/490343 [15:22:35] (03CR) 10Filippo Giunchedi: "This approach to me seems easier to maintain and understand, we'll likely have to make some tweaks e.g. to topic retention for debug level" [puppet] - 10https://gerrit.wikimedia.org/r/490198 (owner: 10Herron) [15:23:01] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventgate: Add single quotes and don't chomp whitespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/490342 (owner: 10Alexandros Kosiaris) [15:23:17] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump eventgate-analytics version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/490343 (owner: 10Alexandros Kosiaris) [15:23:55] vgutierrez: shoot if you need help, is the usual group missing ;) [15:24:13] volans: I guess hieradata/common/monitoring.yaml will do the trick [15:24:29] yes indeed [15:25:06] (03PS1) 10Vgutierrez: monitoring: Add missing monitoring group for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/490344 (https://phabricator.wikimedia.org/T207389) [15:25:10] there you go volans ^^ [15:25:57] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/490344 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:26:05] (03CR) 10Vgutierrez: [C: 03+2] monitoring: Add missing monitoring group for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/490344 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:26:22] 10Operations, 10Continuous-Integration-Infrastructure: jenkins / zuul backing up due to jenkins slaves down - https://phabricator.wikimedia.org/T216039 (10jcrespo) This is T216030 https://lists.wikimedia.org/pipermail/cloud/2019-February/000538.html [15:27:14] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad,codfw] [15:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:26] (03PS1) 10Gehel: elasticsearch: add parameter to set elasticsearch version [puppet] - 10https://gerrit.wikimedia.org/r/490345 (https://phabricator.wikimedia.org/T216047) [15:27:54] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:55] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:27:56] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:13] (03PS2) 10Herron: lists:drop if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/490200 (https://phabricator.wikimedia.org/T215251) [15:28:50] !log stop and upgrade es1014 [15:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:38] RECOVERY - Check correctness of the icinga configuration on icinga2001 is OK: Icinga configuration is correct [15:30:18] vgutierrez: thanks! [15:30:29] np, sorry about that [15:30:44] I got that done for me for certcentral so I didn't realize that part was missing [15:31:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "see inline" (031 comment) [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/490109 (owner: 10Giuseppe Lavagetto) [15:31:38] (03CR) 10Vgutierrez: [C: 03+2] uwsgi: Take buster into account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [15:31:46] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/cli/conf.d/10-mysqlnd.ini],File[/etc/php/7.2/fpm/conf.d/10-mysqlnd.ini],File[/etc/php/7.2/cli/conf.d/15-xml.ini],File[/etc/php/7.2/fpm/conf.d/15-xml.ini] [15:31:47] (03PS4) 10Vgutierrez: uwsgi: Take buster into account [puppet] - 10https://gerrit.wikimedia.org/r/490339 (https://phabricator.wikimedia.org/T215925) [15:32:26] someone maybe testing 7.2 on phab?^ [15:35:36] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10GTirloni) /var/lib/nova/instances took some damage today: ` root@cloudvirt1018:~# xfs_repair /dev/mapper/tank-data Phase 1 - find and verify superblock... Phase 2 - using... [15:36:43] (03PS1) 10Volans: graphite: use unique descriptions for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) [15:37:17] (03PS1) 10Jbond: Add jbond to sms alerts [puppet] - 10https://gerrit.wikimedia.org/r/490347 [15:37:31] (03CR) 10jerkins-bot: [V: 04-1] graphite: use unique descriptions for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) (owner: 10Volans) [15:38:17] (03PS2) 10Jbond: Add jbond to sms alerts [puppet] - 10https://gerrit.wikimedia.org/r/490347 [15:38:25] (03PS2) 10Volans: graphite: use unique descriptions for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) [15:38:34] (03CR) 10Vgutierrez: [C: 03+2] authdns: Allow acme-chief servers to fulfill ACME DNS challenges [puppet] - 10https://gerrit.wikimedia.org/r/490286 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:38:42] (03PS2) 10Vgutierrez: authdns: Allow acme-chief servers to fulfill ACME DNS challenges [puppet] - 10https://gerrit.wikimedia.org/r/490286 (https://phabricator.wikimedia.org/T207389) [15:39:42] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: add parameter to set elasticsearch version [puppet] - 10https://gerrit.wikimedia.org/r/490345 (https://phabricator.wikimedia.org/T216047) (owner: 10Gehel) [15:40:16] (03CR) 10DCausse: [C: 03+1] aptrepo: create new elastic56 component [puppet] - 10https://gerrit.wikimedia.org/r/490341 (https://phabricator.wikimedia.org/T216047) (owner: 10Gehel) [15:40:27] (03CR) 10Jbond: [C: 03+2] Add jbond to sms alerts [puppet] - 10https://gerrit.wikimedia.org/r/490347 (owner: 10Jbond) [15:41:04] (03PS3) 10Vgutierrez: authdns: Allow acme-chief servers to fulfill ACME DNS challenges [puppet] - 10https://gerrit.wikimedia.org/r/490286 (https://phabricator.wikimedia.org/T207389) [15:41:14] (03CR) 10DCausse: [C: 03+1] "could we set 5.6 for deployment-prep?" [puppet] - 10https://gerrit.wikimedia.org/r/490345 (https://phabricator.wikimedia.org/T216047) (owner: 10Gehel) [15:41:21] * vgutierrez fighting to merge a CR [15:42:06] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: allow to fine-tune request timeouts [puppet] - 10https://gerrit.wikimedia.org/r/490077 [15:42:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks for adding tests too!" [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) (owner: 10Gehel) [15:43:38] (03CR) 10Mathew.onipe: [C: 03+1] cassandra: package should be installed after apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/490303 (https://phabricator.wikimedia.org/T214073) (owner: 10Gehel) [15:44:52] (03CR) 10Gehel: "> could we set 5.6 for deployment-prep?" [puppet] - 10https://gerrit.wikimedia.org/r/490345 (https://phabricator.wikimedia.org/T216047) (owner: 10Gehel) [15:46:02] !log Stop MySQL on db1106 for onsite maintenance - this will generate lag on s1 labs - T214840 [15:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:06] T214840: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 [15:46:15] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10fsero) [15:46:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: allow to fine-tune request timeouts [puppet] - 10https://gerrit.wikimedia.org/r/490077 (owner: 10Giuseppe Lavagetto) [15:46:32] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: allow to fine-tune request timeouts [puppet] - 10https://gerrit.wikimedia.org/r/490077 [15:47:24] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490350 [15:51:29] (03PS1) 10Ottomata: Always use stdout logger so we can see logs in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/490351 (https://phabricator.wikimedia.org/T211247) [15:52:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) (owner: 10Volans) [15:53:30] (03PS2) 10Ottomata: Always use stdout logger so we can see logs in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/490351 (https://phabricator.wikimedia.org/T211247) [15:55:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Always use stdout logger so we can see logs in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/490351 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [15:56:48] https://commons.wikimedia.org/wiki/File:Edittools_JS_errors.png [15:57:02] should I open a report? ^ [15:59:09] (03PS1) 10Ottomata: eventgate-analytics: Use logstash.svc.eqiad.wmnet always [deployment-charts] - 10https://gerrit.wikimedia.org/r/490352 (https://phabricator.wikimedia.org/T211247) [15:59:19] (03PS1) 10Muehlenhoff: Update timedatectl Icinga check for buster [puppet] - 10https://gerrit.wikimedia.org/r/490353 (https://phabricator.wikimedia.org/T213527) [15:59:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics: Use logstash.svc.eqiad.wmnet always [deployment-charts] - 10https://gerrit.wikimedia.org/r/490352 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:02:41] (03PS3) 10Herron: lists:drop if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/490200 (https://phabricator.wikimedia.org/T215251) [16:03:28] (03PS1) 10Alexandros Kosiaris: Bump eventgate-analytics version [deployment-charts] - 10https://gerrit.wikimedia.org/r/490354 [16:05:14] (03PS2) 10Gehel: aptrepo: create new elastic56 component [puppet] - 10https://gerrit.wikimedia.org/r/490341 (https://phabricator.wikimedia.org/T216047) [16:05:59] (03CR) 10Gehel: [C: 03+2] aptrepo: create new elastic56 component [puppet] - 10https://gerrit.wikimedia.org/r/490341 (https://phabricator.wikimedia.org/T216047) (owner: 10Gehel) [16:08:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Bump eventgate-analytics version [deployment-charts] - 10https://gerrit.wikimedia.org/r/490354 (owner: 10Alexandros Kosiaris) [16:11:57] (03CR) 10Eevans: WIP: initial (strawman) configuration for session storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [16:13:00] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:13:01] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:02] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:07] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/490353 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [16:14:52] (03PS4) 10Herron: lists:drop if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/490200 (https://phabricator.wikimedia.org/T215251) [16:15:46] (03CR) 10Herron: [C: 03+2] lists:drop if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/490200 (https://phabricator.wikimedia.org/T215251) (owner: 10Herron) [16:19:05] (03PS2) 10Gehel: elasticsearch: add parameter to set elasticsearch version [puppet] - 10https://gerrit.wikimedia.org/r/490345 (https://phabricator.wikimedia.org/T216047) [16:20:20] (03CR) 10Gehel: [C: 03+2] elasticsearch: add parameter to set elasticsearch version [puppet] - 10https://gerrit.wikimedia.org/r/490345 (https://phabricator.wikimedia.org/T216047) (owner: 10Gehel) [16:21:29] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) After trying to configure the rock-dkml package on Stretch with 4.9 and 4.14, I found this: https://github.com/RadeonOpenCompute/ROCm/... [16:22:09] !log otto@deploy1001 scap-helm list [namespace: list, clusters: staging] [16:22:09] !log otto@deploy1001 scap-helm list cluster staging completed [16:22:09] !log otto@deploy1001 scap-helm list finished [16:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:35] (03PS2) 10Muehlenhoff: Update timedatectl Icinga check for buster [puppet] - 10https://gerrit.wikimedia.org/r/490353 (https://phabricator.wikimedia.org/T213527) [16:23:07] (03PS1) 10Elukey: Revert "Reimage stat1005 with buster" [puppet] - 10https://gerrit.wikimedia.org/r/490355 [16:23:20] (03Abandoned) 10Elukey: Revert "Reimage stat1005 with buster" [puppet] - 10https://gerrit.wikimedia.org/r/490355 (owner: 10Elukey) [16:23:39] (03PS1) 10Elukey: Revert "Set Debian installer to Stretch for stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/490356 [16:23:46] (03PS2) 10Elukey: Revert "Set Debian installer to Stretch for stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/490356 [16:24:37] (03CR) 10Elukey: [C: 03+2] Revert "Set Debian installer to Stretch for stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/490356 (owner: 10Elukey) [16:24:56] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) (owner: 10Volans) [16:25:22] (03PS3) 10Muehlenhoff: Update timedatectl Icinga check for buster [puppet] - 10https://gerrit.wikimedia.org/r/490353 (https://phabricator.wikimedia.org/T213527) [16:25:33] (03PS1) 10Ottomata: eventgate - Add stream-config to checksumed files [deployment-charts] - 10https://gerrit.wikimedia.org/r/490357 (https://phabricator.wikimedia.org/T211247) [16:26:07] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10jijiki) @Joe I updated the table above [16:27:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventgate - Add stream-config to checksumed files [deployment-charts] - 10https://gerrit.wikimedia.org/r/490357 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:27:59] (03CR) 10Muehlenhoff: [C: 03+2] Update timedatectl Icinga check for buster [puppet] - 10https://gerrit.wikimedia.org/r/490353 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [16:28:55] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - Add stream-config to checksumed files [deployment-charts] - 10https://gerrit.wikimedia.org/r/490357 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:29:05] (03PS12) 10Filippo Giunchedi: hieradata: use Prometheus 2 on prometheus2003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 (https://phabricator.wikimedia.org/T187987) [16:30:25] !log reimage stat1005 to Debian Buster (again) [16:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:27] (03PS3) 10Elukey: role::analytics_test_cluster::coord: add kafkatee instance [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) [16:41:59] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use Prometheus 2 on prometheus2003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [16:42:07] (03PS13) 10Filippo Giunchedi: hieradata: use Prometheus 2 on prometheus2003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 (https://phabricator.wikimedia.org/T187987) [16:42:56] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coord: add kafkatee instance [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:43:04] (03PS4) 10Elukey: role::analytics_test_cluster::coord: add kafkatee instance [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) [16:43:07] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::analytics_test_cluster::coord: add kafkatee instance [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:51:04] (03PS1) 10Jbond: Update the location of the contacts file [puppet] - 10https://gerrit.wikimedia.org/r/490362 (https://phabricator.wikimedia.org/T82937) [16:57:58] PROBLEM - Host mw1299.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:58:42] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:43] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:58:43] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:03:12] RECOVERY - Host mw1299.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.32 ms [17:04:08] RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [17:04:21] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Cmjohnson) a:05Cmjohnson→03RobH I replaced CPU1 with new. Powered the server on. Assigning to @robh to coordinate re-pooling and resolving Return shipping USPS 9202 3946 5301... [17:06:36] !log db1106, troubleshooting idrac issue and updating f/w [17:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:19] thanks cmjohnson1! [17:10:40] PROBLEM - mediawiki-installation DSH group on mw1299 is CRITICAL: Host mw1299 is not in mediawiki-installation dsh group [17:11:05] (03Abandoned) 10Catrope: Enable ORES on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489984 (https://phabricator.wikimedia.org/T215354) (owner: 10Catrope) [17:11:24] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:12:38] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:13:25] (03CR) 10Eevans: WIP: initial (strawman) configuration for session storage (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [17:13:42] (03PS5) 10Eevans: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [17:13:53] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues - very interesting to see that other people opened bugs for NULL pointer... [17:14:19] (03CR) 10jerkins-bot: [V: 04-1] WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [17:14:40] urandom: jenkins is against you --^ [17:18:04] elukey: yeah, but I don't really understand it [17:18:57] urandom: ah yes you are adding a class to a role [17:20:26] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10RobH) a:05RobH→03jijiki I've synced with @jijiki who is returning this to service and will comment on here. [17:20:52] urandom: do you need it? afaics it is contained in profile::cassandra no? [17:21:08] PROBLEM - HHVM rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:33] (I mean ::password::cassandra) [17:21:41] (in the role) [17:22:07] elukey: so...TBH, I cargo-culted modules/role/manifests/restbase/base.pp here, but maybe it is wrong as well [17:22:12] RECOVERY - HHVM rendering on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 74496 bytes in 0.263 second response time [17:22:55] urandom: so in theory profile::cassandra already includes ::password::cassandra [17:23:06] and you include ::profile::cassandra in the role [17:23:23] so if you remove ::password::cassandra from the role, then jenkins will be happy [17:24:35] (03PS6) 10Eevans: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [17:25:25] yep :) [17:25:30] !log Pooling mw1299 back - T215569 [17:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:33] T215569: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 [17:25:39] elukey: thanks :) [17:25:41] 10Operations, 10ops-eqiad: kafka1012 power supply alerts - https://phabricator.wikimedia.org/T216011 (10Cmjohnson) 05Open→03Declined This sever came from thae a batch of r720's in 2011 and have faulty power modules on the main board. There are several like this. This server and all R720's from that batch... [17:25:52] elukey: I guess the restbase role is wrong too then [17:27:58] urandom: could be yes! [17:28:44] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10Gehel) p:05Triage→03High [17:31:06] (03PS6) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [17:33:12] PROBLEM - Host ms-be1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:33:42] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10jijiki) 05Open→03Resolved Server is pooled. [17:33:44] If nobody is swat-ting, I'm self-servicing two patches [17:35:59] * Krinkle takes mwdebug1002 [17:39:41] (03PS1) 10Vgutierrez: installserver: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490371 (https://phabricator.wikimedia.org/T207389) [17:40:24] RECOVERY - Host ms-be1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [17:40:49] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable help panel search on kowiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490372 (https://phabricator.wikimedia.org/T209301) [17:43:47] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking sane: https://puppet-compiler.wmflabs.org/compiler1001/14651/" [puppet] - 10https://gerrit.wikimedia.org/r/490371 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:44:13] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10Cmjohnson) I physically cannot turn the server on either, I tried pulling the power and waiting 10 minutes but I just get a flashing green indicator at the power button. I am able to access the ilo's... [17:45:49] (03PS7) 10Eevans: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [17:45:53] (03PS1) 10Vgutierrez: archiva: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490374 (https://phabricator.wikimedia.org/T207389) [17:48:09] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Julia.glen) I created this ticket as instructed in https://wikitech.wikimedia.org/wiki/Production_shell_access to gain access to the sandbox to deploy the code. I am good t... [17:48:36] (03PS2) 10Vgutierrez: archiva: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490374 (https://phabricator.wikimedia.org/T207389) [17:49:47] !log Stop MYSQL on db1114 for onsite maintenance - T214720 [17:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:49] T214720: db1114 crashed - https://phabricator.wikimedia.org/T214720 [17:50:16] 10Operations, 10MediaWiki-Database, 10Performance-Team, 10Wikimedia-Logstash, and 5 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10jcrespo) ^I don't have enough context for this patch, is configuration for regular servers setup to nor produce DEBUG lines outsid... [17:51:00] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking good: https://puppet-compiler.wmflabs.org/compiler1001/14653/" [puppet] - 10https://gerrit.wikimedia.org/r/490374 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:51:37] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490350 [17:52:05] * elukey off! [17:53:13] (03PS6) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) [17:53:29] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool es1014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490350 (owner: 10Jcrespo) [17:53:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) Chris has upgraded FW/BIOS on db1106 (thanks!) - so tomorrow I will do a few more reboots to keep debugging this. [17:53:49] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Gehel) @Julia.glen your access should already be working. I checked the logs on stat1007 and I don't see any login attempt from either `juliaglen` or `julia.glen`. Ping me on... [17:54:00] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Cmjohnson) I requested a new CPU but w/out Dell's idrac log stating it's a CPU there is a good chance they will kick it back. You have successfully submitted request SR986384843. [17:54:49] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490350 (owner: 10Jcrespo) [17:55:53] (03PS1) 10Filippo Giunchedi: prometheus: use rules_ops.yml for prometheus 2 [puppet] - 10https://gerrit.wikimedia.org/r/490375 (https://phabricator.wikimedia.org/T187987) [17:55:57] (03PS1) 10Vgutierrez: dumps: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490376 (https://phabricator.wikimedia.org/T207389) [17:57:02] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10bd808) 05Open→03Resolved a:03akosiaris [17:57:44] (03PS36) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [17:57:47] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490350 (owner: 10Jcrespo) [17:58:04] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking good: https://puppet-compiler.wmflabs.org/compiler1001/14654/" [puppet] - 10https://gerrit.wikimedia.org/r/490376 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:59:46] (03PS1) 10Vgutierrez: gerrit: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490379 (https://phabricator.wikimedia.org/T207389) [18:00:58] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.17/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Id70fdfa62ef / T215611 (duration: 00m 55s) [18:01:03] AaronSchulz: ^ [18:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:05] T215611: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 [18:01:13] (03PS3) 10Volans: graphite: use unique descriptions for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) [18:02:14] (03CR) 10Volans: [C: 03+2] graphite: use unique descriptions for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/490346 (https://phabricator.wikimedia.org/T211692) (owner: 10Volans) [18:03:32] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking good: https://puppet-compiler.wmflabs.org/compiler1001/14657/" [puppet] - 10https://gerrit.wikimedia.org/r/490379 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:03:51] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10Cmjohnson) A ticket has been opened with HPE Case ID: 5336351338 Case title: Failed Mother Board Severity 3-Normal Product serial number: MXQ70601RN Product number: 719061-B21 Submitted: 2/13/2019 1:0... [18:03:58] (03PS2) 10Jbond: Update the location of the contacts file [puppet] - 10https://gerrit.wikimedia.org/r/490362 (https://phabricator.wikimedia.org/T82937) [18:05:18] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Cmjohnson) @elukey is this a 1G or 10G rack? [18:05:27] (03PS1) 10Vgutierrez: icinga: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490380 (https://phabricator.wikimedia.org/T207389) [18:06:45] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Cmjohnson) [18:06:58] !log reimage prometheus2003 - T187987 [18:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:01] T187987: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 [18:07:14] godog: \o/ [18:07:32] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/490362 (https://phabricator.wikimedia.org/T82937) (owner: 10Jbond) [18:07:45] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) >>! In T214760#4951769, @Cmjohnson wrote: > I requested a new CPU but w/out Dell's idrac log stating it's a CPU there is a good chance they will kick it back. > > You have... [18:08:39] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking as expected: https://puppet-compiler.wmflabs.org/compiler1001/14658/" [puppet] - 10https://gerrit.wikimedia.org/r/490380 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:08:57] (03PS2) 10Filippo Giunchedi: prometheus: use rules_ops.yml for prometheus 2 [puppet] - 10https://gerrit.wikimedia.org/r/490375 (https://phabricator.wikimedia.org/T187987) [18:09:15] (03PS7) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [18:09:45] (03PS1) 10Vgutierrez: librenms: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490381 (https://phabricator.wikimedia.org/T207389) [18:10:16] vgutierrez: can you give me the nutshell of why we need the acme chief cert (what s it, why do we need it also)? [18:10:21] thanks in advance [18:10:38] oh nm I am reading the bug report now [18:10:45] apergos: naming conflict ;) [18:10:50] RECOVERY - mediawiki-installation DSH group on mw1299 is OK: OK [18:10:51] this is what happens when my brain has already checked out for the day, d'oh [18:11:00] apergos: we are renaming certcentral to acme-chief, to do it in a safe way we will deploy first the acme-chief ones and then switch from the cc one to the acme-chief one [18:11:06] (03CR) 10Jbond: [C: 03+2] Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [18:11:06] it's actually the same certificate [18:11:08] right [18:11:13] just a new name [18:11:17] grarg [18:11:20] just different naming... sorry about the noise .( [18:12:31] yeah no worries [18:12:40] this looks ok to me, giving the thumbs up on the cr now [18:13:05] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1014 (duration: 00m 52s) [18:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:43] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking good: https://puppet-compiler.wmflabs.org/compiler1001/14660/" [puppet] - 10https://gerrit.wikimedia.org/r/490381 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:13:53] (03CR) 10ArielGlenn: [C: 03+1] "Having read the ticket and checked the conf file, looks ok to me. Adding bstorm because the public-facing dumps content servers are the cl" [puppet] - 10https://gerrit.wikimedia.org/r/490376 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:16:25] (03PS3) 10Filippo Giunchedi: prometheus: use yaml rules files for prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/490375 (https://phabricator.wikimedia.org/T187987) [18:19:42] (03PS1) 10Vgutierrez: lists: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490382 (https://phabricator.wikimedia.org/T207389) [18:20:16] elukey around? [18:20:53] (03PS4) 10Filippo Giunchedi: prometheus: use yaml rules files for prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/490375 (https://phabricator.wikimedia.org/T187987) [18:22:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use yaml rules files for prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/490375 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [18:22:52] (03PS5) 10Filippo Giunchedi: prometheus: use yaml rules files for prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/490375 (https://phabricator.wikimedia.org/T187987) [18:23:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) I updated the bios to the latest version as of February 11, 2019 v2.9.1 updated idrac to latest version 2.61.60.60 [18:24:02] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking as expected: https://puppet-compiler.wmflabs.org/compiler1001/14663/" [puppet] - 10https://gerrit.wikimedia.org/r/490382 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:26:44] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) [18:28:04] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) Ok, I leave up to @fgiunchedi and @Ottomata to think about to how to productionize the "deployment" of model... [18:28:12] xSavitar: is there a plan for deploying https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/487007/ ? [18:29:03] (03CR) 10Framawiki: [C: 03+1] Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix) [18:30:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) Thanks, Chris! [18:30:57] (03PS1) 10Mathew.onipe: elasticsearch: add type to delete query [software/spicerack] - 10https://gerrit.wikimedia.org/r/490383 (https://phabricator.wikimedia.org/T207920) [18:32:26] Beta-only puppet patch, looking for a reviewer, maybe _joe_or godog- https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/488524/ [18:36:27] (03PS9) 10Dzahn: testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [18:40:43] Krinkle: not familiar with but I can merge if you like [18:40:54] ottomata: sure :) [18:41:02] (03PS3) 10Ottomata: mediawiki: Remove beta-cluster specific auto_prepend_file override [puppet] - 10https://gerrit.wikimedia.org/r/488524 (https://phabricator.wikimedia.org/T176370) (owner: 10Krinkle) [18:41:08] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mediawiki: Remove beta-cluster specific auto_prepend_file override [puppet] - 10https://gerrit.wikimedia.org/r/488524 (https://phabricator.wikimedia.org/T176370) (owner: 10Krinkle) [18:41:43] (03PS10) 10Dzahn: testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [18:41:49] (03PS4) 10Krinkle: PhpAutoPrepend: Remove PhpAutoPrepend-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486177 [18:42:46] (03CR) 10jerkins-bot: [V: 04-1] testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [18:43:02] (03CR) 10Krinkle: [C: 03+1] Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix) [18:43:42] (03CR) 10jerkins-bot: [V: 04-1] PhpAutoPrepend: Remove PhpAutoPrepend-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486177 (owner: 10Krinkle) [18:45:43] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486177 (owner: 10Krinkle) [18:47:26] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10EBernhardson) >>! In T148843#4947755, @elukey wrote: > Let's see if we can narrow down the packages needed: > > > - hsa-rocr-dev - AMD Hete... [18:49:00] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#4938798, @elukey wrote: > @... [18:54:18] 10Operations, 10monitoring, 10Graphite, 10Patch-For-Review: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Volans) 05Open→03Resolved a:03Volans [18:55:03] Can i have something deployed for the swat window quickly? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/490326 [18:55:55] 10Operations, 10Continuous-Integration-Infrastructure: jenkins / zuul backing up due to jenkins slaves down - https://phabricator.wikimedia.org/T216039 (10thcipriani) p:05High→03Normal Built a new integration-castor and undid my dirty hacks to the `castor-save-*` jobs. Lowering priority but leaving open: w... [18:59:34] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T1900) [19:04:12] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:05:01] (03PS2) 10Mathew.onipe: elasticsearch: add doc type to delete query [software/spicerack] - 10https://gerrit.wikimedia.org/r/490383 (https://phabricator.wikimedia.org/T207920) [19:08:56] (03PS1) 10Mathew.onipe: profile::maps: remove replication factor [puppet] - 10https://gerrit.wikimedia.org/r/490389 (https://phabricator.wikimedia.org/T215521) [19:11:44] (03PS11) 10Dzahn: testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [19:11:56] (03CR) 10Jforrester: DNM Define Wikibase "entity sources" on beta commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490108 (https://phabricator.wikimedia.org/T214557) (owner: 10WMDE-leszek) [19:12:36] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:16:03] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10GTirloni) @RobH @Cmjohnson Slots 2 & 3 were part of the outage today. Even though they show as online, could we replace them? They are likely in a pair so we'll need to... [19:17:32] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:23:46] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:24:44] 10Operations, 10Parsoid: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) p:05Triage→03Normal [19:25:15] 10Operations, 10Parsoid: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) [19:25:43] (03PS1) 10Dzahn: start to decom ruthenium, turn into spare [puppet] - 10https://gerrit.wikimedia.org/r/490391 (https://phabricator.wikimedia.org/T216062) [19:28:43] (03PS2) 10Ottomata: prometheus: make rules and alerts configuration backwards compatible in beta [puppet] - 10https://gerrit.wikimedia.org/r/488530 (owner: 10Cwhite) [19:30:10] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:31:36] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:32:48] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:32:55] (03PS12) 10Dzahn: testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [19:33:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14664/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [19:33:46] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:34:40] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:35:08] (03CR) 10jerkins-bot: [V: 04-1] testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [19:38:14] PROBLEM - Disk space on cloudvirt1024 is CRITICAL: DISK CRITICAL - /var/lib/nova/instances is not accessible: Input/output error [19:38:29] uh [19:38:42] arturo? or someone? [19:38:51] yeah [19:39:06] that's another degraded raid host right? [19:39:16] (just paged) [19:39:52] yes, we are on top of that, thanks apergos [19:40:13] ok, as long as we know the page can be ignored [19:40:30] um can someone ack it or something? [19:44:07] (03PS1) 10Ammarpad: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) [19:47:40] ACKNOWLEDGEMENT - Disk space on cloudvirt1024 is CRITICAL: DISK CRITICAL - /var/lib/nova/instances is not accessible: Input/output error andrew bogott This is terrible and we are looking into it [19:47:41] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) andrew bogott This is terrible and we are looking into it [19:48:01] thanks ottomata [19:48:03] (03CR) 10Sjoerddebruin: [C: 03+1] "Can confirm consensus for this change, configuration seems correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [19:48:10] !log mforns@deploy1001 Started deploy [analytics/refinery@5f1461e]: Deploying analytics refinery with refinery-source v0.0.85 jars [19:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:28] (03CR) 10jerkins-bot: [V: 04-1] Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [19:48:48] "docker: Error response from daemon: failed to create endpoint mystifying_volhard on network bridge: failed to save bridge " [19:49:08] (in a jenkins-bot downvote) [19:52:41] jdlrobson: Just connecting to bouncer now, yes we can deploy it. [19:52:52] When do you think I should schedule it? [19:53:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [19:53:43] note: CI is currently being worked on [19:53:47] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,LIST} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:54:21] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:55:05] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:55:16] roger [19:55:25] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:55:46] !log mforns@deploy1001 Finished deploy [analytics/refinery@5f1461e]: Deploying analytics refinery with refinery-source v0.0.85 jars (duration: 07m 36s) [19:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:44] jdlrobson: Morning SWAT tomorrow? [19:58:07] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) > 10:12 < cmjohnson1> : robh Dell approved everything....the disks for cloudvirts and the cpu for icinga1001 So we are on track. Chris can update this task with case i... [19:58:27] xSavitar: sounds like a plan! [19:58:41] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:58:48] Putting it on schedule then jdlrobson [19:59:17] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:59:59] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:00:04] thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T2000). [20:00:21] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:00:50] (03CR) 10Krinkle: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [20:01:32] (03PS1) 10CRusnov: Add ganeti read-only user deployment [puppet] - 10https://gerrit.wikimedia.org/r/490397 [20:03:07] (03CR) 10jerkins-bot: [V: 04-1] Add ganeti read-only user deployment [puppet] - 10https://gerrit.wikimedia.org/r/490397 (owner: 10CRusnov) [20:03:10] * thcipriani train [20:03:54] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) [20:04:15] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) 05Open→03Resolved [20:04:52] (03PS1) 10Thcipriani: group1 wikis to 1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490398 [20:04:57] (03CR) 10Thcipriani: [C: 03+2] group1 wikis to 1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490398 (owner: 10Thcipriani) [20:05:45] RECOVERY - Disk space on cloudvirt1024 is OK: DISK OK [20:05:58] * apergos raises an eyebrow [20:06:11] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490398 (owner: 10Thcipriani) [20:06:12] it's good that it's back [20:06:16] jdlrobson: On schedule, https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190214T1900 [20:08:52] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.17 [20:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:46] !log thcipriani@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.17 (duration: 00m 53s) [20:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:37] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [20:11:13] PROBLEM - Nginx local proxy to videoscaler on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.151 second response time [20:11:51] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 271 bytes in 0.093 second response time [20:12:27] RECOVERY - Nginx local proxy to videoscaler on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.153 second response time [20:15:18] chuckles at "< wikibugs> Fundraising Sprint Casino Royale With Cheese," [20:15:30] what kind of cheese tho? [20:15:52] cashew ?:) [20:16:02] that surprised me how good that was [20:16:11] what's in that? [20:17:33] it's so called "vegan cheese", but it's a nut product [20:17:56] https://en.wikipedia.org/wiki/File:Vegan_Cheese_Happy_Cheese_Cashew_2.jpg [20:18:23] (03PS2) 10Ottomata: Add page-links-change event to EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) (owner: 10Bmansurov) [20:18:27] see ingredient list in image description. f.e. [20:18:27] cooking with mutante ? [20:18:32] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add page-links-change event to EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) (owner: 10Bmansurov) [20:19:13] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:20:29] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:20:31] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [20:20:57] PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.077 second response time [20:21:01] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [20:21:15] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [20:21:17] PROBLEM - Nginx local proxy to jobrunner on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.154 second response time [20:21:39] PROBLEM - Nginx local proxy to videoscaler on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.151 second response time [20:21:45] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.083 second response time [20:22:05] PROBLEM - Nginx local proxy to videoscaler on mw1296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.152 second response time [20:22:10] thcipriani, ? [20:22:11] PROBLEM - Nginx local proxy to jobrunner on mw1296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.156 second response time [20:22:13] RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.081 second response time [20:22:15] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time [20:22:33] RECOVERY - Nginx local proxy to jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.156 second response time [20:22:50] what's up? [20:22:55] RECOVERY - Nginx local proxy to videoscaler on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.156 second response time [20:22:59] Krenair: I'm not seeing errors to account for this. I think this is more of the of the hhvm flailing. [20:23:03] k [20:23:21] RECOVERY - Nginx local proxy to videoscaler on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.157 second response time [20:23:27] RECOVERY - Nginx local proxy to jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.152 second response time [20:23:45] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.088 second response time [20:29:20] (03PS1) 10Ottomata: service::docker s/directory_ensure/ensure_directory/ [puppet] - 10https://gerrit.wikimedia.org/r/490400 [20:29:37] (03PS12) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [20:29:39] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [20:29:47] 10Operations, 10Release-Engineering-Team, 10Wikimedia-production-error: HHVM CPU usage when deploying MediaWiki - https://phabricator.wikimedia.org/T208549 (10thcipriani) [20:30:07] PROBLEM - Nginx local proxy to jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.153 second response time [20:30:53] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time [20:31:06] 10Operations, 10Release-Engineering-Team, 10Wikimedia-production-error: HHVM CPU usage when deploying MediaWiki - https://phabricator.wikimedia.org/T208549 (10thcipriani) 05Resolved→03Open p:05Unbreak!→03High I see this still happening. Last night it happened when deploying l10n for an extension. Ha... [20:31:09] (03CR) 10Ottomata: [C: 03+2] service::docker s/directory_ensure/ensure_directory/ [puppet] - 10https://gerrit.wikimedia.org/r/490400 (owner: 10Ottomata) [20:31:21] RECOVERY - Nginx local proxy to jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.159 second response time [20:31:47] (03CR) 10Dzahn: [V: 03+1 C: 03+1] admin: add Petar Petkovic to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489775 (https://phabricator.wikimedia.org/T215575) (owner: 10Cwhite) [20:31:52] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10colewhite) p:05Triage→03Normal [20:32:30] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10colewhite) p:05Triage→03Normal [20:32:48] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "email address matching now" [puppet] - 10https://gerrit.wikimedia.org/r/489775 (https://phabricator.wikimedia.org/T215575) (owner: 10Cwhite) [20:33:09] (03CR) 10Cwhite: [C: 03+2] admin: add Petar Petkovic to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489775 (https://phabricator.wikimedia.org/T215575) (owner: 10Cwhite) [20:33:16] (03PS2) 10Cwhite: admin: add Petar Petkovic to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489775 (https://phabricator.wikimedia.org/T215575) [20:34:01] 10Operations, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Peachey88) [20:35:15] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10GTirloni) Yep, slot 0 and 3 are gone and need replacement. ` cloudvirt1024 / SAS Addr 0x500056b37c0f19c0 / Slot 0 / Model BTYS810309EP1P9DGNSSDSC2KB019T7R / Serial SCV1DL5... [20:35:53] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T216068 [20:35:53] (03PS1) 10Herron: logstash: move role::ls::eventlogging to profile::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/490401 (https://phabricator.wikimedia.org/T213898) [20:35:59] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T216068 (10ops-monitoring-bot) [20:36:45] (03CR) 10jerkins-bot: [V: 04-1] logstash: move role::ls::eventlogging to profile::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/490401 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [20:37:18] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10GTirloni) Looking in the RAID controller firmware logs, it seems we have consistent issues with all disks (which could point to a faulty controller, cable or enclosure). Wh... [20:37:43] (03PS2) 10Herron: logstash: move role::ls::eventlogging to profile::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/490401 (https://phabricator.wikimedia.org/T213898) [20:39:59] (03PS2) 10Dzahn: start to decom ruthenium, turn into spare [puppet] - 10https://gerrit.wikimedia.org/r/490391 (https://phabricator.wikimedia.org/T216062) [20:40:23] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:25] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10colewhite) [20:42:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T216068 (10colewhite) [20:42:46] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T216068 (10colewhite) Resolving as duplicate of parent. [20:43:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T216068 (10colewhite) 05Open→03Resolved p:05Triage→03Normal [20:43:03] !log ms-be2021 - powercycling [20:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:05] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10colewhite) [20:44:21] !log otto@deploy1001 Started restart [eventstreams/deploy@07033d4]: bouncing eventstreams to apply page-links-change stream config [20:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:39] 10Operations, 10Traffic: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758 (10Dzahn) This issue happened today on ms-be2021 {P8082}. 15:40 <+icinga-wm> PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% 15:43 < mutante> !log ms-be2021 - powercy... [20:45:55] RECOVERY - Host ms-be2021 is UP: PING WARNING - Packet loss = 73%, RTA = 0.54 ms [20:48:02] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Access request: Ladsgroup to analytics-wmde-users - https://phabricator.wikimedia.org/T215938 (10colewhite) p:05Triage→03Normal a:03colewhite [20:50:00] 10Operations, 10SRE-Access-Requests: Access request: Ladsgroup to analytics-wmde-users - https://phabricator.wikimedia.org/T215938 (10colewhite) [20:50:35] 10Operations, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) Doing 'Active -> Staged' transition https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Staged Not sure if i should still copy/paste the check boxes from the temp... [20:51:25] (03CR) 10Dzahn: [C: 03+2] start to decom ruthenium, turn into spare [puppet] - 10https://gerrit.wikimedia.org/r/490391 (https://phabricator.wikimedia.org/T216062) (owner: 10Dzahn) [20:51:35] (03PS1) 10Cwhite: admin: add ladsgroup to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/490403 (https://phabricator.wikimedia.org/T215938) [20:53:32] (03PS3) 10Herron: logstash: move role::ls::eventlogging to profile::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/490401 (https://phabricator.wikimedia.org/T213898) [20:54:11] !log ruthenium - shell access for parsoid-testers revoked by puppet, please use scandium.eqiad.wmnet (T201366) [20:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:14] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [20:55:28] (03PS1) 10CDanis: partman: grub-install on all RAID{1,10} drives [puppet] - 10https://gerrit.wikimedia.org/r/490404 (https://phabricator.wikimedia.org/T215183) [20:56:07] (03PS1) 10Effie Mouzeli: thumbor: add support for debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) [20:56:11] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10RobH) update from irc chat: Dell support site for this systems firmware updates: https://www.dell.com/support/home/us/en/04/product-support/servicetag/31s8kh2/drivers I'... [20:56:39] (03CR) 10jerkins-bot: [V: 04-1] thumbor: add support for debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T2100). [21:00:58] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/14666/" [puppet] - 10https://gerrit.wikimedia.org/r/490401 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [21:01:31] (03PS1) 10Dzahn: admins/enforce-users-groups: remove exception for parsoid-rt user [puppet] - 10https://gerrit.wikimedia.org/r/490407 (https://phabricator.wikimedia.org/T216062) [21:02:18] (03PS2) 10Effie Mouzeli: thumbor: add support for debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) [21:02:32] (03PS1) 10Ottomata: Use $version when getting docker image in docker-service-shim.erb [puppet] - 10https://gerrit.wikimedia.org/r/490409 [21:02:47] (03CR) 10Dzahn: "[scandium:~] $ id parsoid-rt" [puppet] - 10https://gerrit.wikimedia.org/r/490407 (https://phabricator.wikimedia.org/T216062) (owner: 10Dzahn) [21:03:23] (03CR) 10jerkins-bot: [V: 04-1] thumbor: add support for debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [21:04:08] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/14669/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [21:04:51] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:04:54] (03PS1) 10Dzahn: DHCP: turn ruthenium into stretch, rm jessie installer part [puppet] - 10https://gerrit.wikimedia.org/r/490411 (https://phabricator.wikimedia.org/T216062) [21:05:13] (03CR) 10Ottomata: [C: 03+2] Use $version when getting docker image in docker-service-shim.erb [puppet] - 10https://gerrit.wikimedia.org/r/490409 (owner: 10Ottomata) [21:05:46] (03PS3) 10Effie Mouzeli: thumbor: add support for debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) [21:06:03] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:07:52] (03PS1) 10Gehel: admin: reset Julia SSH key [puppet] - 10https://gerrit.wikimedia.org/r/490412 (https://phabricator.wikimedia.org/T215966) [21:09:06] (03CR) 10Dzahn: [C: 03+2] DHCP: turn ruthenium into stretch, rm jessie installer part [puppet] - 10https://gerrit.wikimedia.org/r/490411 (https://phabricator.wikimedia.org/T216062) (owner: 10Dzahn) [21:09:18] (03PS2) 10Dzahn: DHCP: turn ruthenium into stretch, rm jessie installer part [puppet] - 10https://gerrit.wikimedia.org/r/490411 (https://phabricator.wikimedia.org/T216062) [21:10:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10TJones) Thanks, @Gehel! >>! In T215966#4952468, @gerritbot wrote: > Change 490412 had a related patch set uploaded (by Gehel; owner: Gehel): > [operati... [21:10:17] (03CR) 10Tjones: [C: 03+1] admin: reset Julia SSH key [puppet] - 10https://gerrit.wikimedia.org/r/490412 (https://phabricator.wikimedia.org/T215966) (owner: 10Gehel) [21:10:35] (03CR) 10Gehel: [C: 03+2] admin: reset Julia SSH key [puppet] - 10https://gerrit.wikimedia.org/r/490412 (https://phabricator.wikimedia.org/T215966) (owner: 10Gehel) [21:11:38] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1001/14671/thumbor1001.eqiad.wmnet/ Expected changes in catalogue, I don't expect any actual " [puppet] - 10https://gerrit.wikimedia.org/r/490405 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [21:11:52] (03PS3) 10Dzahn: DHCP: turn ruthenium into stretch, rm jessie installer part [puppet] - 10https://gerrit.wikimedia.org/r/490411 (https://phabricator.wikimedia.org/T216062) [21:16:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Julia.glen) Connection established. Thank you. [21:19:00] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10TJones) Woo hoo! [21:23:19] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "role::eqiad::scb: Switch rdb1006 to redis::misc::master" [puppet] - 10https://gerrit.wikimedia.org/r/486256 (owner: 10Effie Mouzeli) [21:23:36] (03PS2) 10Effie Mouzeli: Revert "role::eqiad::scb: Switch rdb1006 to redis::misc::master" [puppet] - 10https://gerrit.wikimedia.org/r/486256 [21:25:07] (03PS1) 10Herron: lists: enforce domain or ip literal HELO check [puppet] - 10https://gerrit.wikimedia.org/r/490416 (https://phabricator.wikimedia.org/T215251) [21:25:09] (03PS1) 10Herron: lists: drop connection if remote tries to send HELO [puppet] - 10https://gerrit.wikimedia.org/r/490417 (https://phabricator.wikimedia.org/T215251) [21:27:19] (03PS2) 10Herron: lists: drop connection if remote tries to send HELO [puppet] - 10https://gerrit.wikimedia.org/r/490417 (https://phabricator.wikimedia.org/T215251) [21:29:52] (03CR) 10Framawiki: Add new throttle rule for T215839 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) (owner: 10Zoranzoki21) [21:31:03] !log Restarting nutcracker on scb100*.eqiad.wmnet [21:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:21] (03PS1) 10Ottomata: Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) [21:38:01] (03CR) 10Ottomata: "This change is safe to go out now, since it still defines the $wgEventServiceUrl, but we might want to wait until is 100% deployed." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [21:41:38] (03CR) 10jerkins-bot: [V: 04-1] Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [21:46:46] (03CR) 10Effie Mouzeli: [C: 03+1] "This is more simple! :D" [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) (owner: 10Dzahn) [21:49:08] PROBLEM - Check systemd state on cloudvirt1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:52:09] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/14672/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) (owner: 10Dzahn) [21:53:23] (03CR) 10Effie Mouzeli: [C: 03+1] profile::redis::multidc: Remove trusty support [puppet] - 10https://gerrit.wikimedia.org/r/489732 (owner: 10Muehlenhoff) [21:57:52] PROBLEM - Check systemd state on cloudvirt1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:06:57] (03PS1) 10Herron: lists: add 5 second smtp banner delay [puppet] - 10https://gerrit.wikimedia.org/r/490481 [22:07:53] (03PS2) 10Herron: lists: add 5 second smtp banner delay [puppet] - 10https://gerrit.wikimedia.org/r/490481 (https://phabricator.wikimedia.org/T215251) [22:07:59] (03CR) 10jerkins-bot: [V: 04-1] lists: add 5 second smtp banner delay [puppet] - 10https://gerrit.wikimedia.org/r/490481 (https://phabricator.wikimedia.org/T215251) (owner: 10Herron) [22:09:21] (03PS5) 10Dzahn: mediawiki/scap: do not install sql scripts on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) [22:16:17] (03CR) 10Dzahn: [C: 03+2] "thanks! i did another run because mw1233 doesn't seem to exist anymore. but also noop on mw1267. https://puppet-compiler.wmflabs.org/compi" [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) (owner: 10Dzahn) [22:18:03] (03CR) 10Thcipriani: [C: 03+1] "nice!" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485499 (https://phabricator.wikimedia.org/T207703) (owner: 10Giuseppe Lavagetto) [22:20:46] 10Operations, 10serviceops, 10Patch-For-Review: "sql" command fails with "sh: 1: mysql: not found" on mwdebug1002 - https://phabricator.wikimedia.org/T211512 (10Dzahn) 05Open→03Resolved a:03Dzahn @Krinkle The change above has been merged today. This removed the non-working sql / sqldump scripts from ca... [22:21:09] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10jrbs) >>! In T210464#4812325, @bcampbell wrote: > @Jalexander it looks like you requested OIT to rename trustandsafety@ to tsops@ on 8/1/18. I can confirm that trustands... [22:25:50] 10Operations, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['ruthenium.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/... [22:50:27] 10Operations, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) [22:51:39] 10Operations, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) [22:53:53] 10Operations, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) [22:54:34] 10Operations, 10HHVM, 10Wikimedia-production-error: mw1338 hhvm claiming intermittently about TC - https://phabricator.wikimedia.org/T216084 (10thcipriani) [22:56:38] 10Operations, 10HHVM, 10Wikimedia-production-error: mw1338 hhvm complaining intermittently about TC - https://phabricator.wikimedia.org/T216084 (10thcipriani) [22:58:46] 10Operations, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) [22:59:00] 10Operations, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) [22:59:36] 10Operations, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ruthenium.eqiad.wmnet'] ` and were **ALL** successful. [23:00:04] Niharika and bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimania scholarships app deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190213T2300). [23:00:29] o/ [23:01:41] bd808: I think you'll remember this better than I do. Is this stlll accurate? https://wikitech.wikimedia.org/wiki/Scholarships.wikimedia.org#How_is_it_deployed? [23:02:14] Niharika: maybe... I'm wrapping up a call [23:02:34] bd808: Okay, I'll poke around meanwhile. [23:03:03] Niharika: the server names are still accurate at least [23:03:19] mutante: Good to know! [23:03:38] Niharika: yeah, that should be right. We should do the db migration first though [23:04:07] i mean that it's installed on krypton and you can deploy from deploy1001. did not check the db server name [23:04:47] bd808: Is the process for that same/similar to what we do on vagrant? [23:05:33] (03CR) 10Cwhite: [C: 03+2] prometheus: do not change trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [23:05:48] (03PS3) 10Cwhite: prometheus: do not change trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) [23:07:20] Niharika: I have generally run them manually. I can do that again in ... ~5 minutes [23:07:56] bd808: Or tell me how to! We can wait too. I'm not sure I have rights to access db1048.eqiad.wmnet [23:10:14] PROBLEM - Check systemd state on cloudvirt1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:12:37] (03PS1) 10EBernhardson: Re-apply defaults removed in cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490496 [23:13:14] (03PS2) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in labs [puppet] - 10https://gerrit.wikimedia.org/r/489753 (https://phabricator.wikimedia.org/T213708) [23:14:31] (03PS1) 10Dzahn: netboot/partman/DCHP: remove ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/490497 (https://phabricator.wikimedia.org/T216062) [23:14:36] (03CR) 10EBernhardson: [C: 04-2] "-2 to not merge until elastic 6 is being deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490496 (owner: 10EBernhardson) [23:17:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Gehel) 05Open→03Resolved a:03Gehel [23:18:21] Niharika: ok! I'm off my interview call. let's do this! [23:18:35] \o/ [23:18:44] I hope you got a chance to have lunch. [23:18:47] should we just in the hangout from the cal invite you sent me so I can screen share what I'm doing? [23:19:07] bd808: Sounds good. [23:20:16] (03CR) 10Cwhite: "With I1a7a22e183cbe5877f16013608768a027a2685e2 merged, any labs hosts running trusty should ignore the update." [puppet] - 10https://gerrit.wikimedia.org/r/489753 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [23:20:51] (03CR) 10Dzahn: [C: 03+2] netboot/partman/DCHP: remove ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/490497 (https://phabricator.wikimedia.org/T216062) (owner: 10Dzahn) [23:24:08] 10Operations, 10DC-Ops, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10Dzahn) a:05Dzahn→03RobH [23:26:57] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for baham [dns] - 10https://gerrit.wikimedia.org/r/490102 (https://phabricator.wikimedia.org/T199247) (owner: 10Papaul) [23:28:24] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10colewhite) p:05Triage→03Normal [23:31:30] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata for esanders - https://phabricator.wikimedia.org/T215830 (10colewhite) p:05Triage→03Normal a:03colewhite [23:32:52] (03PS1) 10Cwhite: admin: add esanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/490500 (https://phabricator.wikimedia.org/T215830) [23:33:18] PROBLEM - Long running screen/tmux on snapshot1005 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 57515, 1735415s 1728000s). [23:36:11] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) a:03colewhite [23:40:04] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "i confirm Amir works as engineer for WMDE. Raz is he engineering manager per https://wikimedia.de/de/menschen/mitarbeitende addshore is a" [puppet] - 10https://gerrit.wikimedia.org/r/490403 (https://phabricator.wikimedia.org/T215938) (owner: 10Cwhite) [23:42:29] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "approved by Nuria, existing account, ticket looks good" [puppet] - 10https://gerrit.wikimedia.org/r/490500 (https://phabricator.wikimedia.org/T215830) (owner: 10Cwhite) [23:42:32] !log niharika29@deploy1001 Started deploy [scholarships/scholarships@1d89fe2]: Update scholarships app for 2019 cycle T215302 [23:42:35] !log niharika29@deploy1001 Finished deploy [scholarships/scholarships@1d89fe2]: Update scholarships app for 2019 cycle T215302 (duration: 00m 02s) [23:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:37] T215302: Website Revamp - https://phabricator.wikimedia.org/T215302 [23:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:15] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Jrbranaa) > In any case, and with the risk of repeating myself, the service is under a code stewardship r... [23:55:28] mutante: are you around? Niharika and I need a bit of help with log files [23:55:51] bd808: yes, is it about krypton? [23:56:11] yeah, we need you to check the apache2 error log there for the scholarships vhost [23:56:38] we are getting a 500 error and no log data in logstash or on mwlog1001 :? [23:57:02] PHP Fatal error: Class 'PHPMailer' not found [23:57:19] ok! thanks [23:57:28] vendor/wikimedia/slimapp/src/Mailer.php on line 100, [23:58:09] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/490416 (https://phabricator.wikimedia.org/T215251) (owner: 10Herron) [23:58:25] you are right that we should send these to logstash [23:58:57] as "misc" though, so wouldbt be mwlog