[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T0000). Please do the needful. [00:02:52] !log deploying phabricator release/2018-08-08/1 [00:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:14] !log finished phabricator upgrade [00:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:38] Why does PHP7 seem slower under quibble than hhvm? [00:14:15] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.16/extensions/VisualEditor/: Bug fix (duration: 00m 58s) [00:14:18] MatmaRex: Done [00:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:26] Reedy: thanks! works as expected [00:15:30] on mw.org [00:16:07] cool [00:20:50] (03PS1) 10Dzahn: netbox: don't hardcode db_master, use active_server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/451550 [00:23:15] (03PS2) 10Dzahn: netbox: don't hardcode db_master, use active_server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/451550 [00:23:29] (03PS3) 10Dzahn: netbox: don't hardcode db_master, use active_server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/451550 [00:25:04] (03CR) 10Dzahn: [C: 032] "noop for prod but let's me use it on VPS: http://puppet-compiler.wmflabs.org/12024/" [puppet] - 10https://gerrit.wikimedia.org/r/451550 (owner: 10Dzahn) [00:30:40] (03PS2) 10Dzahn: postgres::master: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450316 [00:32:15] (03CR) 10Dzahn: "only labsdb1004 is using this. it's not a generic role for any postgres master because it has "labsadmin@labs" user hardcoded." [puppet] - 10https://gerrit.wikimedia.org/r/450316 (owner: 10Dzahn) [00:33:40] (03CR) 10Dzahn: [C: 032] postgres::master: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450316 (owner: 10Dzahn) [00:35:39] (03CR) 10Dzahn: [C: 032] "noop on labsdb1004. these never do anything, it's still the same class" [puppet] - 10https://gerrit.wikimedia.org/r/450316 (owner: 10Dzahn) [00:38:00] (03CR) 10jenkins-bot: Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [00:41:13] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Dzahn) I see "racadm getsel" has been cleared and does not show an error anymore. I will re-add bast1002 to smokeping (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450870... [00:41:22] (03PS2) 10Dzahn: Revert "smokeping: comment out broken bast1002" [puppet] - 10https://gerrit.wikimedia.org/r/450870 [00:50:19] (03CR) 10Dzahn: [C: 032] "it has been up again for a while, RAM module had been moved to other slot" [puppet] - 10https://gerrit.wikimedia.org/r/450870 (owner: 10Dzahn) [00:51:50] (03CR) 10Dzahn: [C: 04-1] "shinken is using this and still on trusty" [puppet] - 10https://gerrit.wikimedia.org/r/448770 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [00:53:11] twentyafterfour: MatmaRex keeps being blocked on phab. But he's in Trusted Contributors [00:53:22] well, twice so far [00:53:46] i filed https://phabricator.wikimedia.org/T201573 (that got me banned), Reedy unbanned me, and then i commented on https://phabricator.wikimedia.org/T201472 (that got me banned again) [00:54:02] also, phabricator did not send me email for these two actions (i have it configured to send mails for my own actions). does it not send email to disabled accounts? [00:54:59] ooh, i've got a notification [00:55:03] "matmarex triggered vandalism countermeasures (Account Disabled) by editing T201573: Generalize logic for inserting a block level element into an empty paragraph (it should replace that paragraph, not insert before it)." [00:55:06] T201573: Generalize logic for inserting a block level element into an empty paragraph (it should replace that paragraph, not insert before it) - https://phabricator.wikimedia.org/T201573 [00:55:06] "matmarex triggered vandalism countermeasures (Account Disabled) by editing T201472: List insertion by typing '#', '*' is broken." [00:55:07] T201472: List insertion by typing '#', '*' is broken - https://phabricator.wikimedia.org/T201472 [00:55:25] that's kind of neat actually [00:56:07] hmm. did i mention too many tasks/users? that seems like it might do it [00:57:56] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Dzahn) There are Icinga alerts for DNS and Gridmaster: https://icinga.wikimedia.org/cgi-bin/icinga/sta... [00:58:53] ACKNOWLEDGEMENT - Auth DNS on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T201439#4486912 [00:58:53] ACKNOWLEDGEMENT - Check for gridmaster host resolution TCP on cloudservices1003 is CRITICAL: DNS CRITICAL - 0.014 seconds response time (No ANSWER SECTION found) daniel_zahn https://phabricator.wikimedia.org/T201439#4486912 [00:58:53] ACKNOWLEDGEMENT - Check for gridmaster host resolution UDP on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call daniel_zahn https://phabricator.wikimedia.org/T201439#4486912 [00:59:13] ACKNOWLEDGEMENT - puppet last run on cloudservices1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-pdns-exporter] daniel_zahn https://phabricator.wikimedia.org/T201473 [01:03:58] ACKNOWLEDGEMENT - Check systemd state on cloudservices1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T201439 [01:10:30] 10Operations, 10monitoring: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456 (10Dzahn) There are 2 of them CRIT again since a while. also see T108081#4490351 Not sure if i should reopen this. [01:12:33] ACKNOWLEDGEMENT - Ensure legal html en.wp on en.wikipedia.org is CRITICAL: additional\sterms\smay\sapply\. By\susing\sthis\ssite,\syou\sagree\sto\sthe a\shref=(https:)?\/\/foundation\.wikimedia\.org\/wiki\/Terms_of_UseTerms\sof\sUse/a html not found daniel_zahn https://phabricator.wikimedia.org/T108081#4490351 [01:12:53] ACKNOWLEDGEMENT - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: additional\sterms\smay\sapply\. By\susing\sthis\ssite,\syou\sagree\sto\sthe a\shref=(https:)?\/\/foundation\.wikimedia\.org\/wiki\/Terms_of_UseTerms\sof\sUse/a html not found daniel_zahn https://phabricator.wikimedia.org/T108081#4490351 [01:29:35] 10Operations, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Dzahn) a:05Dzahn>03None [01:39:24] !log baham - installing BIOS upgrade (2.4.2 for Dell R320) - server is on role(spare) and the last that did not get the upgrade on T162850 [01:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:34] T162850: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 [01:44:35] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 (10Dzahn) [01:46:28] 10Operations: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 (10Dzahn) [01:51:41] 10Operations: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 (10Dzahn) 05Open>03Resolved baham is done: Installed version: 2.4.2 This resolves the ticket. [01:53:31] 10Operations: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 (10Dzahn) only 6 R320s left today: acamar.wikimedia.org,achernar.wikimedia.org,baham.wikimedia.org,bast2001.wikimedia.org,heze.codfw.wmnet,labservices1002.wikimedia.org [01:55:42] MatmaRex: Reedy: I've raised the filter threshold. I don't know why the exemption for being in trusted-contributors didn't work [01:56:07] twentyafterfour: can you unban me again? i got disabled again after leaving another comment [01:56:19] (Reedy undisabled me twice, but he seems afk now) [01:57:56] twentyafterfour: also, the experience is pretty awful, if someone was actually a new contributor getting accidentally caught in the filter. it just logs you out with no notification or error message :/ [01:59:51] twentyafterfour: also, my third ban was apparently for this comment: https://phabricator.wikimedia.org/T201573#4490368 which can't possibly contain anything untoward, being just plain text. am i going to get banned every time i comment on a task now? [02:00:12] (does it remember somewhere that i should be banned and reapply it when i comment?) [02:00:43] (thanks for reenabling my account) [02:01:44] MatmaRex: I unbanned you [02:02:35] MatmaRex: I'm going to disable the filter, something must be wrong it should have exactly zero false positives [02:02:48] (it's supposed to require rapid editing to trigger it) [02:03:35] twentyafterfour: let me know if you need some testing from me or something. i appreciate the work on countervandalism stuff [02:03:36] also agreed that it's not very friendly, I intend to make it at least give an error message [02:04:31] twentyafterfour: also, on a positive note, i just noticed today that searching for tasks in a given project now also finds tasks in its subprojects, and it definitely used to not find them. great bugfix/feature! [02:05:46] doh! I found the bug... [02:06:09] MatmaRex: awesome, yeah that was an upstream fix recently, I think [02:06:36] !log restarted apache on phab1001 to hotfix antivandalism bug [02:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:06] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.15) (duration: 16m 05s) [02:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:26] PROBLEM - Apache HTTP on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:16] RECOVERY - Apache HTTP on mw2252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.114 second response time [03:03:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:04:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:05:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [03:05:35] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:07:35] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:10:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:12:58] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 15m 23s) [03:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [03:15:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:23:32] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Aug 9 03:23:32 UTC 2018 (duration 10m 34s) [03:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:26] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 851.99 seconds [03:39:35] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 207.85 seconds [03:48:56] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [04:03:05] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [04:22:05] PROBLEM - Check Varnish expiry mailbox lag on cp3032 is CRITICAL: CRITICAL: expiry mailbox lag is 385841639 [04:55:06] 10Operations, 10ops-eqiad, 10DBA: Disk #9 with errors on db1068 (s4 master) - https://phabricator.wikimedia.org/T201493 (10Marostegui) 05Open>03Resolved All good now thank you! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-... [04:58:52] (03PS1) 10Marostegui: db-codfw.php: Repool pc2004 and pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451567 (https://phabricator.wikimedia.org/T201387) [05:00:37] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool pc2004 and pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451567 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [05:01:50] (03Merged) 10jenkins-bot: db-codfw.php: Repool pc2004 and pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451567 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [05:02:50] (03CR) 10jenkins-bot: db-codfw.php: Repool pc2004 and pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451567 (https://phabricator.wikimedia.org/T201387) (owner: 10Marostegui) [05:03:11] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool pc2004 and pc2005 after BIOS upgrade - T201387 (duration: 01m 00s) [05:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:18] T201387: Upgrade pc2004 and pc2005 BIOS - https://phabricator.wikimedia.org/T201387 [05:21:16] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [05:22:16] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [05:44:17] marostegui: OK to deploy cxserver? Any deployment going on? [05:47:21] OK. I'll go ahead :) [05:48:39] !log kartik@deploy1001 Started deploy [cxserver/deploy@957ff6a]: Update cxserver to 27813b6 (T201085) [05:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:46] T201085: CX2: Images missing in source article - https://phabricator.wikimedia.org/T201085 [05:52:39] !log kartik@deploy1001 Finished deploy [cxserver/deploy@957ff6a]: Update cxserver to 27813b6 (T201085) (duration: 03m 59s) [05:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:56] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/enable-puppet] [06:29:41] 10Puppet: Suspicious Comments in Puppet Scripts - https://phabricator.wikimedia.org/T201576 (10Aklapper) [06:30:06] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:44:15] (03CR) 10Muehlenhoff: [C: 031] "Nice, looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446887 (https://phabricator.wikimedia.org/T198649) (owner: 10Volans) [06:55:15] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:57:56] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:14:05] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2007.codfw.wmnet', 'elastic2008... [07:18:38] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2007.codfw.wmnet'] ``` The log... [07:19:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:19:49] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10MoritzMuehlenhoff) Let's simply call these cumin*, it's the primary service offered and the other bits on these hosts (like debdeploy)... [07:20:39] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10MoritzMuehlenhoff) >>! In T201346#4483500, @Volans wrote: > I think there was an agreement to install this a Stretch and perform this... [07:21:00] looking at esams_text ^ [07:21:10] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4) [07:21:49] gehel: you around? ^^^ [07:22:06] looking [07:22:12] RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 33, unassigned_shards: 374, number_of_pending_tasks: 33, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3144, task_max_waiting_in_queue_millis: 63983, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_numb [07:22:12] ctive_shards: 8989, initializing_shards: 62, number_of_data_nodes: 33, delayed_unassigned_shards: 271 [07:22:45] reimaging in progress [07:22:55] but those hosts were depool (checking) [07:23:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:25:04] yep they were depooled, why did we get a timeout? Probably cluster operation taking too long, T193654 should help [07:25:05] T193654: [epic] Run multiple elasticsearch clusters on same hardware - https://phabricator.wikimedia.org/T193654 [07:26:38] !log cp3032 mbox lagged: reboot w/ numa_networking [07:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:49] gehel: did you check the depooled state also on the LVSes? [07:29:07] just to exclude was not an issue with pybal not depooling it [07:41:29] volans: nope, only with confctl [07:41:48] volans: btw, I got your ping before I got the SMS, quite efficient! [07:41:55] lol [07:42:09] RECOVERY - Check Varnish expiry mailbox lag on cp3032 is OK: OK: expiry mailbox lag is 0 [07:42:21] volans: how do you check the state on the LVSes themselves? [07:42:59] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10MoritzMuehlenhoff) icinga1001 seems fine. If we switch to a new alerting tool we'll certainly run it under a different name in parallel to Icinga for quite a while anyway. [07:43:55] gehel: ipvsadm -Ln ? [07:44:12] volans: ok, checking, thanks [07:44:17] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2007.codfw.wmnet'] ``` and were **ALL** successful. [07:51:42] (03PS1) 10Gehel: elasticsearch: shards check should not page. [puppet] - 10https://gerrit.wikimedia.org/r/451583 [07:54:31] (03PS1) 10Ema: Route 10 cache_misc sites to cache_text [dns] - 10https://gerrit.wikimedia.org/r/451585 (https://phabricator.wikimedia.org/T164609) [08:03:06] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2008.codfw.wmnet'] ``` The log... [08:03:22] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2008.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2008.codfw.wmnet... [08:03:51] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2008.codfw.wmnet'] ``` The log... [08:11:17] (03CR) 10Volans: [C: 031] "LGTM, although I didn't tested them modifying my /etc/hosts, but I checked the directors in puppet." [dns] - 10https://gerrit.wikimedia.org/r/451585 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:12:36] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [08:13:37] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [08:21:46] (03PS12) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [08:22:40] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10Patch-For-Review, and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Arrbee) a:03Nikerabbit [08:22:45] (03CR) 10Ema: [C: 032] trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [08:24:09] (03CR) 10Ema: [C: 032] Route 10 cache_misc sites to cache_text [dns] - 10https://gerrit.wikimedia.org/r/451585 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:28:00] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10ema) 05Open>03Resolved [08:28:03] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [08:29:31] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2008.codfw.wmnet'] ``` and were **ALL** successful. [08:36:01] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2010.codfw.wmnet', 'elastic2011... [08:37:41] !log rebooting bast2002 for kernel security update [08:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:35] (03PS12) 10Gehel: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [08:45:52] (03PS1) 10Ema: Deploy ATS backend on cp2003 [puppet] - 10https://gerrit.wikimedia.org/r/451593 (https://phabricator.wikimedia.org/T199720) [08:50:35] \o/ [08:55:34] :) [08:55:34] (03PS2) 10Ema: Deploy ATS backend on cp2003 [puppet] - 10https://gerrit.wikimedia.org/r/451593 (https://phabricator.wikimedia.org/T199720) [08:57:00] (03CR) 10Ema: [C: 032] Deploy ATS backend on cp2003 [puppet] - 10https://gerrit.wikimedia.org/r/451593 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [08:57:51] !log rebooting bast2001 for kernel security update [08:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:23] (03PS1) 10Urbanecm: Transwiki import in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451596 (https://phabricator.wikimedia.org/T201328) [08:59:37] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [08:59:47] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [09:02:03] !log rebooting bast4002 for kernel security update [09:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:46] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp2003.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-aut... [09:03:13] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2011.codfw.wmnet', 'elastic2012.codfw.wmnet', 'elastic2010.codfw.wmnet'] ``` an... [09:06:47] (03PS4) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [09:07:38] (03CR) 10Volans: "Thanks a lot for the review! Most fixes uploaded, see replies inline." (0312 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:07:41] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [09:08:09] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Vgutierrez) [09:08:47] !log rebooting bast5001 for kernel security update [09:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:01] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [09:11:07] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul in lvs2009 on board NICs need to be disabled in the BIOS (in lvs2010 they're already disabled): ```name=lvs2009 root@lvs2009:~# dmesg |grep tg3 [ 2... [09:23:32] !log reset radadm on iron, com2 was unresponsive [09:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:02] !log rebooting iron for kernel security update [09:24:03] *racadm [09:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2003.codfw.wmnet'] ``` and were **ALL** successful. [09:24:51] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10Vgutierrez) [09:25:11] volans: fixed in SAL [09:25:37] moritzm: there was no need, I was mosty trolling ;) [09:26:00] I'm easily trolled :-) [09:27:40] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [09:29:20] (03PS1) 10Muehlenhoff: Enable microcode for bastions [puppet] - 10https://gerrit.wikimedia.org/r/451601 [09:30:16] (03PS2) 10Muehlenhoff: Enable microcode for bastions [puppet] - 10https://gerrit.wikimedia.org/r/451601 [09:38:36] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [09:38:38] Is the EU MW train happening today? [09:38:41] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for bastions [puppet] - 10https://gerrit.wikimedia.org/r/451601 (owner: 10Muehlenhoff) [09:39:07] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @ayounsi interface naming in lvs2009 and lvs2010: |current name| lvs2009|lvs2010| |nic1|enp59s0f0|enp59s0f0| |nic2|enp59s0f1d1|enp59s0f1d1| |nic3|enp175s0f0|e... [09:42:10] !log rebooting pollux for kernel security update [09:42:12] Meh, that means I need to cut my lunch short :S [09:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:43] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [09:43:35] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4489746, @Catrope wrote: > In addition to that, do we have a task for "every service should... [09:50:42] !log mobrovac@deploy1001 Started deploy [restbase/deploy@cb6b4b4]: Drop mobile-sections, feed and media end points from non-WPs - T201103 [09:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [09:50:53] (03PS5) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [09:50:58] (03CR) 10Volans: "See inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:54:27] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [09:55:02] (03PS6) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [09:56:55] !log mobrovac@deploy1001 deploy aborted: Drop mobile-sections, feed and media end points from non-WPs - T201103 (duration: 06m 14s) [09:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:02] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [09:57:40] !log rebooting dubnium for kernel security update [09:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:44] !log mobrovac@deploy1001 Started deploy [restbase/deploy@cb6b4b4]: Drop mobile-sections, feed and media end points from non-WPs - T201103 [09:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:10] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@cb6b4b4]: Drop mobile-sections, feed and media end points from non-WPs - T201103 (duration: 04m 26s) [10:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:17] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [10:03:53] !log rebooting url downloaders for kernel security update (alsafi, actinium, aluminium, alcyone) [10:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:56] !log mobrovac@deploy1001 Started deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #2 - T201103 [10:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:47] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 16052 bytes in 0.122 second response time [10:08:54] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4484792, @Imarlier wrote: > I've been investigating the use of an [[ http://opentracing.io/... [10:14:34] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #2 - T201103 (duration: 08m 38s) [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:43] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [10:14:46] !log mobrovac@deploy1001 Started deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #3 - T201103 [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:37] PROBLEM - Check systemd state on db2069 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:17:56] PROBLEM - MariaDB Slave SQL: x1 on db2069 is CRITICAL: CRITICAL slave_sql_state could not connect [10:18:30] PROBLEM - mysqld processes on db2069 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:18:31] PROBLEM - MariaDB Slave IO: x1 on db2069 is CRITICAL: CRITICAL slave_io_state could not connect [10:18:35] PROBLEM - MariaDB disk space on db2069 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:18:35] PROBLEM - Disk space on db2069 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:18:55] * volans looking [10:18:56] mmmmm [10:18:57] I believe that is a storage crash [10:19:02] it probably crashed [10:19:03] yes [10:19:06] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #3 - T201103 (duration: 04m 20s) [10:19:08] Input/output error [10:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:13] !log mobrovac@deploy1001 Started deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #4 - T201103 [10:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:21] ok, I'll leave it to you then ;) [10:19:21] that was lucky it was codfw [10:19:42] oh, it is a replica [10:19:44] so no rush [10:19:49] I thought it was the master [10:20:18] marostegui: I can take care of it manuel [10:20:22] don't worruy [10:20:35] jynus: thanks I was about to leave :) [10:20:40] I will wait for the raid to be back [10:20:45] and otherwise restart it [10:21:02] it is not a strange issue precisely with this provider's boxes [10:21:42] !log handling db2069 crash, no user impact [10:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:11] is it a "new" batch? [10:22:42] thank you for showing up, volans, BTW [10:23:14] np [10:23:43] (03CR) 10Gehel: Add cookbook entry point script (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:24:02] this is strange, / seems to be accessible, but not /srv [10:24:12] despite being the same storage medium [10:24:43] or maybe it is just the memory cache [10:25:16] PROBLEM - MariaDB Slave Lag: x1 on db2069 is CRITICAL: CRITICAL slave_sql_lag could not connect [10:25:44] !log disabling alerts for db2069 [10:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:59] let's start with what is more urgent [10:26:05] depooling [10:27:49] I think we are a bit low on codfw resources, one x1 with broken BBU, the other with storage issues [10:28:56] 10Operations, 10ops-codfw, 10DBA: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) [10:29:18] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 [10:29:20] (03PS1) 10Jcrespo: mariadb: Depool db2069 due to crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451606 (https://phabricator.wikimedia.org/T201603) [10:31:04] !log rebooting archiva1001 for kernel security update [10:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:23] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #4 - T201103 (duration: 12m 11s) [10:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:30] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [10:31:34] !log mobrovac@deploy1001 Started deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #5 - T201103 [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:48] (03PS2) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [10:35:03] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 (owner: 10Jcrespo) [10:35:32] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2069 due to crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451606 (https://phabricator.wikimedia.org/T201603) (owner: 10Jcrespo) [10:36:11] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 (owner: 10Jcrespo) [10:36:25] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1100 and db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451268 (owner: 10Jcrespo) [10:36:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [10:36:49] (03Merged) 10jenkins-bot: mariadb: Depool db2069 due to crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451606 (https://phabricator.wikimedia.org/T201603) (owner: 10Jcrespo) [10:38:02] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@ece750a]: Drop mobile-sections, feed and media end points from non-WPs, take #5 - T201103 (duration: 06m 27s) [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:08] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [10:38:49] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2069 due to crash (duration: 01m 00s) [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:23] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 and db1123 fully (duration: 00m 59s) [10:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:35] PROBLEM - Host elastic2028 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:16] ^ gehel, 2028 is part of the stretch reimage? [10:50:35] RECOVERY - Host elastic2028 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [10:50:45] please see also my comments on other channel, there seems to be other issues [10:50:55] (03CR) 10jenkins-bot: mariadb: Depool db2069 due to crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451606 (https://phabricator.wikimedia.org/T201603) (owner: 10Jcrespo) [10:50:58] maybe network issues came back? [10:51:03] (03PS1) 10Vgutierrez: lvs2007-lvs2010 production DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/451607 (https://phabricator.wikimedia.org/T196560) [10:52:05] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2030.codfw.wmnet', 'elastic2028.codfw.wmnet', 'elastic2029.codfw.wmnet'] ``` an... [10:57:55] !log truncating commons and others mobile tables - T201103 [10:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] T201103: Reconsider use of RESTBase k-r-v storage for mobileapps - https://phabricator.wikimedia.org/T201103 [10:59:07] "Drive Array Controller Failure (Slot 0)" [10:59:58] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1100). [11:00:04] tgr, Jhs, Urbanecm, and hoo: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:26] PROBLEM - Elasticsearch HTTPS on elastic2028 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2028.codfw.wmnet [11:00:36] o/ [11:01:00] I can SWAT today [11:01:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) [11:01:16] I'm here [11:01:17] tgr: go ahead with your patch while I review other patches [11:02:01] (03PS3) 10Gergő Tisza: Enable TemplateStyles everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T199909) [11:02:08] (03CR) 10Gergő Tisza: [C: 032] Enable TemplateStyles everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T199909) (owner: 10Gergő Tisza) [11:03:40] (03Merged) 10jenkins-bot: Enable TemplateStyles everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T199909) (owner: 10Gergő Tisza) [11:04:04] here [11:05:55] (03CR) 10jenkins-bot: Enable TemplateStyles everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451076 (https://phabricator.wikimedia.org/T199909) (owner: 10Gergő Tisza) [11:06:42] hoo|away: you're a deployer, want to deploy your changes yourself? [11:07:30] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:451076|Enable TemplateStyles everywhere (T199909)]] (duration: 00m 58s) [11:07:30] Jhs, Urbanecm: please stand by, you are next, as soon as tgr is done [11:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:36] T199909: Deploy TemplateStyles everywhere on 2018-08-09 - https://phabricator.wikimedia.org/T199909 [11:07:41] ack [11:07:41] zeljkof: done [11:07:57] tgr: thanks! [11:08:12] Jhs: uh, your patch can not be deployed as-is [11:08:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) [11:09:07] PROBLEM - Elasticsearch HTTPS on elastic2029 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2029.codfw.wmnet [11:09:11] Jhs: there is a (recent) rule (will find it) that a patch must be deployed with one command, and your patch requires two commands [11:09:34] Jhs: it should be easy to split it in two patches, one changing the logos, one using the new logos [11:09:45] Jhs: let me know if you need help with that [11:11:16] (03CR) 10Zfilipin: [C: 04-1] "This patch can not be deployed as-is. There is a (recent) rule (will find it) that a patch must be deployed with one command, and your pat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451541 (https://phabricator.wikimedia.org/T201562) (owner: 10Jon Harald Søby) [11:11:48] Urbanecm: you are next, while Jhs split's his patch [11:12:42] (03PS2) 10Zfilipin: Remove noratelimit from epcoordinator group on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450035 (https://phabricator.wikimedia.org/T201010) (owner: 10Urbanecm) [11:13:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450035 (https://phabricator.wikimedia.org/T201010) (owner: 10Urbanecm) [11:14:12] (03PS1) 10Muehlenhoff: Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 [11:15:02] (03CR) 10jerkins-bot: [V: 04-1] Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 (owner: 10Muehlenhoff) [11:15:15] (03Merged) 10jenkins-bot: Remove noratelimit from epcoordinator group on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450035 (https://phabricator.wikimedia.org/T201010) (owner: 10Urbanecm) [11:15:30] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10jcrespo) mysql start log seems clean except for the mysql database, which is a non issue: ```lines=10 Aug 09 11:10:35 db2069 systemd[1]: Starting MariaDB database server... Aug 09 11:1... [11:16:55] moritzm: thanks for the ping (sorry, was @lunch did not see), yes, those are reimages, no real issue [11:17:40] Urbanecm: 450035 is at mwdebug [11:17:45] ack [11:18:22] zeljkof, please deploy [11:18:31] Urbanecm: ok [11:19:37] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:450035|Remove noratelimit from epcoordinator group on cswiki (T201010)]] (duration: 00m 58s) [11:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:44] T201010: Remove noratelimit from EP coordinator group on cswiki - https://phabricator.wikimedia.org/T201010 [11:19:51] Urbanecm: 450035 deployed [11:19:55] Thanks [11:20:30] gehel: ack [11:20:36] (03CR) 10jenkins-bot: Remove noratelimit from epcoordinator group on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450035 (https://phabricator.wikimedia.org/T201010) (owner: 10Urbanecm) [11:20:41] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410 (10Tgr) [11:20:49] (03PS2) 10Zfilipin: Transwiki import in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451596 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:20:59] (03CR) 10Vgutierrez: [C: 031] "volans DNS validator doesn't show any additional error" [dns] - 10https://gerrit.wikimedia.org/r/451607 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [11:21:02] zeljkof, i don't know how to do that. why does it need two commands? [11:21:54] Jhs, you probably modify two files. You cannot sync two files at once, because the command is called scap sync-file [11:21:59] *modified [11:22:10] Jhs: because files are in two folders, I can sync one folder at a time, tool limitation [11:22:10] Urbanecm: can you help Jhs with the patch? [11:22:14] Will do [11:22:15] (03PS2) 10Muehlenhoff: Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 [11:22:21] thx [11:22:45] Urbanecm: well, scap can actually sync multiple files in the same folder, but can not sync two root folders :/ [11:22:52] aha, so (image/binary) files use a different command than code merges? [11:22:53] I know...not fully precise [11:22:57] (03CR) 10jerkins-bot: [V: 04-1] Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 (owner: 10Muehlenhoff) [11:23:01] No, same command. But you must issue it twice [11:23:33] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410 (10Tgr) 05Open>03Resolved [11:23:34] Jhs: it's just that the files are in two different root folders, and I can sync only one at a time [11:23:49] You must keep changes in static/images/project-logos in separate commit than wmf-config [11:23:54] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451596 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:24:01] ok, so I should have submitted the image changes in one patch and the Initiate-stuff in a different patch? [11:24:08] Yes [11:24:09] Jhs: yes [11:24:18] ok, got it [11:24:28] Please do it now, zeljkof will deploy the pathces in this SWAT, there should be time for it [11:24:38] (well, i hope so :)) [11:24:48] Jhs: there should be time in this window [11:25:02] Urbanecm, if you know how then that will probably be faster :) [11:25:12] Ok, doing the split process [11:25:22] (03Merged) 10jenkins-bot: Transwiki import in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451596 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:25:24] (03PS3) 10Muehlenhoff: Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 [11:25:29] my method of creating a new patch is to delete the folder i have, re-clone it then make the necessary changes… it takes some time :P [11:25:46] Jhs, do you want me to teach a better method? :D [11:26:00] (03CR) 10jerkins-bot: [V: 04-1] Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 (owner: 10Muehlenhoff) [11:26:01] Urbanecm, sure, but not right now :P kinda busy with some other work atm :) [11:26:15] Later, when you and me have time :) [11:26:56] Urbanecm: 451596 at mwdebug [11:27:19] (03PS2) 10Urbanecm: Update logo for hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451541 (https://phabricator.wikimedia.org/T201562) (owner: 10Jon Harald Søby) [11:27:27] ^^ please review&merge PS2 ^^ [11:28:00] (03PS4) 10Muehlenhoff: Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 [11:28:23] Hmm, sysop or steward can test. Can you deploy 451596 please? [11:28:33] (03CR) 10jerkins-bot: [V: 04-1] Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 (owner: 10Muehlenhoff) [11:28:34] Urbanecm: sure [11:29:02] (03PS1) 10Urbanecm: Use HD logos in hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451610 (https://phabricator.wikimedia.org/T201562) [11:29:11] zeljkof, splitting was done by me [11:29:14] zeljkof, and after they are merged, remember to run echo "https://en.wikipedia.org/static/images/project-logos/hewikibooks.png" | mwscript purgeList.php [11:29:41] Urbanecm: thanks! [11:29:43] yw [11:29:48] Jhs: sure, thanks for the reminder [11:30:02] Jhs, just formal correction, not merged, deployed :) [11:30:05] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:451596|Transwiki import in zhwikiversity (T201328)]] (duration: 00m 56s) [11:30:08] ah, right :) [11:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:12] T201328: Transwiki import in zhwikiversity - https://phabricator.wikimedia.org/T201328 [11:30:13] & thanks for the help Urbanecm [11:30:15] (merge = the change is in master branch, deploy = the change is on server) [11:30:16] Yw [11:30:21] Urbanecm: 451596 deployed [11:30:27] ack [11:30:36] !log resume rolling reboots of caches for numa_networking T193865 [11:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:42] T193865: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865 [11:31:21] Urbanecm, Jhs: please add the second patch to the calendar (if it's ready), I can only see one so far [11:31:24] will do [11:31:58] Done [11:32:09] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451541 (https://phabricator.wikimedia.org/T201562) (owner: 10Jon Harald Søby) [11:32:32] Urbanecm: thanks! [11:33:25] (03Merged) 10jenkins-bot: Update logo for hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451541 (https://phabricator.wikimedia.org/T201562) (owner: 10Jon Harald Søby) [11:35:19] Jhs: 451541 is at mwdebug1002, let me know if you need help testing there [11:35:25] (03PS5) 10Muehlenhoff: Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 [11:35:40] (03CR) 10jenkins-bot: Transwiki import in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451596 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:35:41] (03CR) 10jenkins-bot: Update logo for hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451541 (https://phabricator.wikimedia.org/T201562) (owner: 10Jon Harald Søby) [11:35:58] (03CR) 10jerkins-bot: [V: 04-1] Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 (owner: 10Muehlenhoff) [11:37:42] Jhs: do you need more time to test, can I deploy it? [11:38:46] RECOVERY - MariaDB Slave Lag: x1 on db2069 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [11:40:38] ok, checked myself, files are there [11:40:40] deploying [11:42:09] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:451541|Update logo for hewikibooks (T201562)]] (duration: 01m 00s) [11:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:16] T201562: Changing Hebrew WikiBooks logo - https://phabricator.wikimedia.org/T201562 [11:42:46] Jhs: 451541 is deployed, are you still around? [11:44:16] purged [11:45:39] (03CR) 10Zfilipin: "Purged: T201562#4491300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451541 (https://phabricator.wikimedia.org/T201562) (owner: 10Jon Harald Søby) [11:47:30] (03PS2) 10Zfilipin: Use HD logos in hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451610 (https://phabricator.wikimedia.org/T201562) (owner: 10Urbanecm) [11:48:10] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451610 (https://phabricator.wikimedia.org/T201562) (owner: 10Urbanecm) [11:48:41] hoo: around for SWAT? [11:49:01] yes [11:49:29] (03Merged) 10jenkins-bot: Use HD logos in hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451610 (https://phabricator.wikimedia.org/T201562) (owner: 10Urbanecm) [11:50:04] hoo: want to deploy your commits yourself? (I'm finishing the last commit from Jhs) [11:50:26] !log rebooting netmon2001 for kernel security update [11:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:37] (03CR) 10jenkins-bot: Use HD logos in hewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451610 (https://phabricator.wikimedia.org/T201562) (owner: 10Urbanecm) [11:50:48] Yes :) [11:51:01] hoo: great! I'll ping you in a minute or two [11:51:07] Thanks [11:52:12] Urbanecm: looks like Jhs is not longer around, can you help me test 451610? [11:52:25] Sure [11:52:32] Is it on mwdebug? [11:52:38] Urbanecm: I've pushed it to mwdebug, but the logo at https://he.wikibooks.org/wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99 still looks the same :/ [11:53:14] (03PS1) 10Vgutierrez: fix eqiad lvs cross-vlan A records [dns] - 10https://gerrit.wikimedia.org/r/451614 [11:53:38] (03CR) 10Alex Monk: [C: 04-1] "We can change the output format to pson if you wish to supply the patch. It will need to be a separate commit and this needs to be split u" [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [11:53:53] Urbanecm: I'm not sure if I'm doing something wrong [11:54:04] Urbanecm: ah, maybe it's already the new logo? [11:54:14] since I've purged it previously [11:55:02] zeljkof, yes, it is [11:55:15] Please deploy the HD thing if it is not deployed (I have mwdebug enabled just in case :D) [11:56:05] Urbanecm: deploying, thanks, I rarely test and I forgot those details :D [11:56:27] That's why SWAT should never be a task for one :D [11:57:14] Urbanecm: it could, if I knew what I was doing :D [11:57:34] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:451610|Use HD logos in hewikibooks (T201562)]] (duration: 00m 55s) [11:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:41] T201562: Changing Hebrew WikiBooks logo - https://phabricator.wikimedia.org/T201562 [11:57:43] You can make a mistake then, at least in theory :D [11:57:47] Anyway, thank you :) [11:57:53] Jhs, Urbanecm: 451610 is deployed! [11:58:02] I've closed the task [11:58:02] hoo: you are next, swat is yours! [11:58:07] Thanks :) [11:58:18] (03CR) 10Hoo man: [C: 032] Enable RDF export for lexicographical data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) (owner: 10Lucas Werkmeister (WMDE)) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1200) [12:02:00] (03PS2) 10Hoo man: Enable RDF export for lexicographical data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) (owner: 10Lucas Werkmeister (WMDE)) [12:02:12] (03CR) 10Hoo man: [C: 032] Enable RDF export for lexicographical data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) (owner: 10Lucas Werkmeister (WMDE)) [12:02:21] sorry, my pings on IRC aren't working. Thanks for checking & deploying, Urbanecm & zeljkof [12:03:04] (03PS9) 10Vgutierrez: Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [12:03:44] (03Merged) 10jenkins-bot: Enable RDF export for lexicographical data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) (owner: 10Lucas Werkmeister (WMDE)) [12:03:59] (03CR) 10jerkins-bot: [V: 04-1] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:05:29] (03PS10) 10Vgutierrez: Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 [12:05:38] !log rearmed keyholder on netmon2001 [12:05:41] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Enable RDF export for lexicographical data (T201153) (duration: 00m 58s) [12:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:46] (03CR) 10jenkins-bot: Enable RDF export for lexicographical data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450966 (https://phabricator.wikimedia.org/T201153) (owner: 10Lucas Werkmeister (WMDE)) [12:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:48] T201153: re-enable RDF export for Lexemes - https://phabricator.wikimedia.org/T201153 [12:06:22] tgr available ? [12:06:39] Alaa: o/ [12:06:48] !log rebooting netmon1002 for kernel security update [12:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] works [12:07:30] tgr can you please try to fix this one https://phabricator.wikimedia.org/T201314 (the user has more than 50K edits and it's stuck since 6 August!) [12:07:54] it's on my list [12:08:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:09:00] (03CR) 10Alex Monk: Move get_certs out of CertCentral class (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:09:00] Aha. Thanks tgr ^^ [12:09:13] (03PS5) 10Hoo man: Do not leak local $wgWBShared… variables to the global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [12:09:23] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable RDF export for lexicographical data (T201153) (duration: 00m 56s) [12:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:41] (03CR) 10Hoo man: [C: 032] Do not leak local $wgWBShared… variables to the global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [12:10:26] !log rearmed keyholder on netmon1002 [12:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:07] (03Merged) 10jenkins-bot: Do not leak local $wgWBShared… variables to the global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [12:12:28] (03CR) 10Vgutierrez: Move get_certs out of CertCentral class (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:12:47] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:13:09] (03PS1) 10Vgutierrez: Refactor CertCentral API [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 [12:13:15] (03CR) 10Alex Monk: [C: 032] Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:13:38] !log rebooting netmon1003 (servermon.wikimedia.org) for kernel security update [12:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:19] (03CR) 10Vgutierrez: "text/yaml --> text/pson change will have its own commit" [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:14:23] (03Merged) 10jenkins-bot: Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:14:24] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Do not leak local $wgWBShared… variables to the global scope (duration: 00m 56s) [12:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:11] * hoo declares SWAT done [12:15:25] (03CR) 10jenkins-bot: Move get_certs out of CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/451271 (owner: 10Vgutierrez) [12:15:37] (03PS2) 10Vgutierrez: Refactor CertCentral API [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 [12:18:35] Jhs: it's important we managed to get it done :D cc Urbanecm [12:18:53] I'm not watching this [12:18:55] SWAT is over now? [12:19:19] (03CR) 10Alex Monk: Refactor CertCentral API (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:19:28] yeah… and I already didn't finish in time :S [12:19:35] (03CR) 10Alex Monk: [C: 04-1] "PS1 comments" [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:20:58] (03CR) 10jenkins-bot: Do not leak local $wgWBShared… variables to the global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [12:22:39] !log rebooting ununpentium for kernel security update [12:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:33] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [12:23:44] 10Operations, 10Core-Platform-Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) >>! In T201409#4491050, @mobrovac wrote: > > > Perfect -- to be clear, I wasn't making any objectio... [12:25:26] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:25:31] (03PS3) 10Vgutierrez: Refactor CertCentral API [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 [12:26:02] (03CR) 10Vgutierrez: Refactor CertCentral API (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:26:56] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Deskana) 05stalled>03Open a:03Mvolz [12:30:25] (03CR) 10Alex Monk: Refactor CertCentral API (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:31:00] (03CR) 10Alex Monk: [C: 032] Refactor CertCentral API [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:32:04] (03Merged) 10jenkins-bot: Refactor CertCentral API [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:33:04] (03CR) 10jenkins-bot: Refactor CertCentral API [software/certcentral] - 10https://gerrit.wikimedia.org/r/451615 (owner: 10Vgutierrez) [12:34:03] RECOVERY - Elasticsearch HTTPS on elastic2029 is OK: SSL OK - Certificate elastic2029.codfw.wmnet valid until 2023-08-08 12:32:49 +0000 (expires in 1824 days) [12:35:14] RECOVERY - Elasticsearch HTTPS on elastic2028 is OK: SSL OK - Certificate elastic2028.codfw.wmnet valid until 2023-08-08 12:31:52 +0000 (expires in 1824 days) [12:37:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2013.codfw.wmnet', 'elastic2014... [12:41:03] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @BBlack Could use a hand from you/someone on your team. I've generated... [12:41:36] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Imarlier) 05Open>03Resolved a:03Bstorm [12:45:55] (03PS7) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [12:45:57] (03CR) 10Volans: "See inline, implemented class approach." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:48:19] (03PS3) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [12:49:41] (03PS1) 10Ema: Deploy ATS backends on remaining codfw test hosts [puppet] - 10https://gerrit.wikimedia.org/r/451620 (https://phabricator.wikimedia.org/T199720) [12:50:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [12:51:53] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:52:16] (03PS1) 10Muehlenhoff: Enable microcode on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/451622 [12:52:47] (03CR) 10Ema: [C: 032] Deploy ATS backends on remaining codfw test hosts [puppet] - 10https://gerrit.wikimedia.org/r/451620 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:53:03] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 34 ESP OK [12:53:14] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:54:04] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [12:56:49] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) So a few things: 1) We'll have to hack in the rewrite manually in VCL, bu... [12:58:56] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` ['cp2009.codfw.wmnet', 'cp2015.codfw.wmnet', 'cp2021.codfw.wmnet'] ``` T... [12:59:14] (03PS2) 10Muehlenhoff: Enable microcode on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/451622 [13:00:02] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) >>! In T199252#4491444, @BBlack wrote: > So a few things: > 3)... [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1300) [13:01:03] (03CR) 10Ema: [C: 031] Enable microcode on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/451622 (owner: 10Muehlenhoff) [13:02:17] (03CR) 10Muehlenhoff: [C: 032] Enable microcode on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/451622 (owner: 10Muehlenhoff) [13:04:03] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) >>! In T199252#4491444, @BBlack wrote: > So a few things: > 1) We'll ha... [13:04:26] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [13:04:36] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [13:04:59] (03PS1) 10Ema: Deploy ATS backends on eqiad test hosts [puppet] - 10https://gerrit.wikimedia.org/r/451623 (https://phabricator.wikimedia.org/T199720) [13:05:11] (03PS8) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [13:06:02] (03CR) 10Ema: [C: 032] Deploy ATS backends on eqiad test hosts [puppet] - 10https://gerrit.wikimedia.org/r/451623 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:08:13] 10Operations, 10Citoid, 10Services (watching), 10VisualEditor (Current work): Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10Mvolz) p:05Triage>03High [13:08:52] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet'] ``` The log can be foun... [13:08:59] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [13:10:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [13:10:45] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:11:37] arturo: FYI a bunch of labstore related alarms flapping ^^^ [13:12:32] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Milimetric) I have a proposal that, whether practical or not, may help us answer @awight's question. Whe... [13:12:39] mmm [13:12:46] why I didn't recv a SMS? [13:13:20] they might not be set to page :) [13:14:22] Aug 9 13:13:59 labstore1005 maintain-dbusers[102345]: pymysql.err.InternalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') [13:15:00] (03CR) 10Matěj Suchánek: "The submitted patch was https://gerrit.wikimedia.org/r/440859, this one can be abandoned." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440722 (https://phabricator.wikimedia.org/T197526) (owner: 10Urbanecm) [13:16:49] (03PS1) 10Ema: Move trafficserver::backend hiera settings from codfw to common [puppet] - 10https://gerrit.wikimedia.org/r/451626 (https://phabricator.wikimedia.org/T199720) [13:17:10] (03CR) 10Gehel: [C: 031] "Good enough for me. I'm slightly disappointed to see the cookbook's tree only represented as a string, and not as a proper tree, but it is" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:17:45] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 34 ESP OK [13:17:57] (03CR) 10Ema: [C: 032] Move trafficserver::backend hiera settings from codfw to common [puppet] - 10https://gerrit.wikimedia.org/r/451626 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:19:05] jynus: is m5-master in read-only mode? [13:20:08] it shouldn't [13:20:12] it would alert if not [13:21:07] (03PS2) 10BBlack: cacheproxy: reduce fq flow limit 1Gbps -> 256Mbps [puppet] - 10https://gerrit.wikimedia.org/r/451535 [13:21:24] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:21:29] arturo: read_only | OFF [13:21:49] are you confusing it with testlabswiki? [13:23:03] https://phabricator.wikimedia.org/T201082 [13:25:58] jynus: the only clue I have is that error message in labstore1005 [13:26:27] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:27:06] this is the complete traceback [13:27:09] https://www.irccloud.com/pastebin/W6sLnfGz/ [13:27:40] that is probably wikireplicas [13:27:45] will handle that later [13:27:59] (at a meeting) [13:28:10] ok, will go for lunch now as well [13:28:28] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1071.eqiad.wmnet', 'cp1072.eqiad.wmnet'] ``` and were **ALL** successful. [13:28:57] !log rebooting dns5001 for kernel security update/enabling microcode updates [13:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:17] (03PS3) 10BBlack: cacheproxy: reduce fq flow limit 1Gbps -> 256Mbps [puppet] - 10https://gerrit.wikimedia.org/r/451535 [13:34:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [13:35:19] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [13:37:19] arturo: I think I know what it is [13:37:47] https://gerrit.wikimedia.org/r/451307 [13:37:56] note it was reviewed by cloud and dbas [13:38:02] but apparently had issues [13:38:12] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2013.codfw.wmnet', 'elastic2014.codfw.wmnet', 'elastic2015.codfw.wmnet'] ``` Of... [13:38:14] I think the best option is to add SUPER to maintain-users [13:38:26] I will send a patch [13:38:43] arturo: will you able to retry/restart the script? [13:41:02] (03PS1) 10Ema: ATS clusters: set IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/451628 (https://phabricator.wikimedia.org/T199720) [13:41:42] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2013.codfw.wmnet', 'elastic2014... [13:42:54] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2013.codfw.wmnet'] ``` The log... [13:43:19] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:44:00] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [13:46:35] (03CR) 10Ema: [C: 032] ATS clusters: set IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/451628 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:47:06] (03PS5) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [13:47:08] (03PS1) 10Jcrespo: wikireplicas: Add SUPER privileges to cloud admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) [13:48:10] !log rebooting dns5002 for kernel security update/enabling microcode updates [13:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:18] (03PS2) 10Jcrespo: wikireplicas: Add SUPER privileges to cloud admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) [13:48:53] (03CR) 10Jcrespo: "This is a followup to correct issues of I5b05cb2e149bd892af55ef54503714016392d2e3" [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [13:51:35] (03PS3) 10Jcrespo: wikireplicas: Add SUPER privileges to cloud admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) [13:55:01] !log rebooting dns4001 for kernel security update/enabling microcode updates [13:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:23] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [13:56:58] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) a:03ema [13:57:49] arturo: I have deployed the grant changes [14:00:24] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [14:00:42] what I don't know is if those will create some account gaps or they are fixed automatically [14:00:43] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [14:06:17] (03CR) 10Jcrespo: "I have applied this in production- do account gaps have to be filled in, or are they automatically solved?" [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:09:07] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2013.codfw.wmnet'] ``` and were **ALL** successful. [14:09:15] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:10:50] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:11:53] (03PS9) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [14:12:23] !log rebooting dns4002 for kernel security update/enabling microcode updates [14:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:33] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2014.codfw.wmnet', 'elastic2015... [14:13:05] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1073.eqiad.wmnet', 'cp1074.eqiad.wmnet'] ``` The log can be foun... [14:13:33] ema, gehel are you doing it on purpose to start reimages a the same time to find race conditions in the process? :-P [14:14:25] (03CR) 10Gehel: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:15:07] volans: yep, there is something broken and I want to generate enough logs so that you can debug the issue :) [14:15:18] (03CR) 10Krinkle: [C: 04-1] "Sorry, but it needs to be the other way around. Currently this patch removes the reference, and is set as child of the parent commit that " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna) [14:15:20] volans: happy? [14:15:21] (03CR) 10Volans: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:15:36] :) [14:17:02] (03CR) 10Krinkle: [C: 04-1] "You might know it already, but I usually use `git rebase origin/master -i` for this purpose; allows one to switch around the two commits a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna) [14:17:05] volans: we had to write a fair bit of code to do the coordination between ema and myself. We actually send reimage requests to a kafka topic, we have a golang script that pulls that topic waiting for 2 requests to be available, which then sends a message through redis to a scala daemon that logs into neodymium to start the reimage [14:17:24] rotfl [14:17:34] (03CR) 10Jcrespo: "So should I merge, or should we wait for more opinions, what do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:18:35] !log rebooting fermium (lists) to switch out of -rt kernel variant [14:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:47] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Dzahn) [14:25:49] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Diego da Hora - https://phabricator.wikimedia.org/T201197 (10Dzahn) 05Open>03stalled [14:26:54] !log rebooting nescio for kernel security update/enabling microcode updates [14:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:56] !log rolling reboot of mx servers for kernel updates [14:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:39] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Dzahn) [14:30:41] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10Dzahn) 05Open>03stalled [14:30:49] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Dzahn) [14:30:53] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Marc Jeanmougin - https://phabricator.wikimedia.org/T201198 (10Dzahn) 05Open>03stalled [14:34:29] !log rebooting maerlant for kernel security update/enabling microcode updates [14:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:06] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2015.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2015.codfw.wmnet... [14:39:45] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson I am resolving this, open it again if the problem returns [14:40:39] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) HP is sending me a replacement battery...should be here sometime today or early tomorrow (8/10) [14:44:18] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1073.eqiad.wmnet', 'cp1074.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['cp1073.eqiad.wmnet', 'cp1074.eqi... [14:44:21] !log rebooting dns2001 for kernel security update/enabling microcode updates [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:06] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:48:25] (03PS4) 10Jcrespo: wikireplicas: Add SUPER privileges to cloud admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) [14:48:44] (03PS4) 10BBlack: cacheproxy: reduce fq flow limit 1Gbps -> 256Mbps [puppet] - 10https://gerrit.wikimedia.org/r/451535 [14:48:50] (03CR) 10Jcrespo: [C: 032] "If anyone else has issues with this, just comment and we can amend." [puppet] - 10https://gerrit.wikimedia.org/r/451629 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:49:19] (03PS5) 10BBlack: cacheproxy: reduce fq flow limit 1Gbps -> 256Mbps [puppet] - 10https://gerrit.wikimedia.org/r/451535 [14:49:38] (03CR) 10BBlack: [C: 032] cacheproxy: reduce fq flow limit 1Gbps -> 256Mbps [puppet] - 10https://gerrit.wikimedia.org/r/451535 (owner: 10BBlack) [14:50:56] !log all cpNNNN: max tcp per-flow rate reducing 1Gbps -> 256Mbps - https://gerrit.wikimedia.org/r/c/operations/puppet/+/451535 [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:04] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2015.codfw.wmnet'] ``` The log... [14:52:26] !log rebooting dns2002 for kernel security update/enabling microcode updates [14:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:47] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10Krenair) These are the people who fit @EddieGP's suggested criteria: P7440 [14:54:06] PROBLEM - Host elastic2015 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:16] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) @bblack The DIMM has been replaced with new, please resolve task once satisified Return Tracking USPS 9202 3946 5301 2439 4635 97 FEDEX 9611918 23... [14:56:55] RECOVERY - Host elastic2015 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [14:57:23] !log mobrovac@deploy1001 Started deploy [citoid/deploy@92e5071]: Record format stats [14:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:15] PROBLEM - Check systemd state on elastic2015 is CRITICAL: Return code of 255 is out of bounds [14:59:16] PROBLEM - Elasticsearch HTTPS on elastic2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:59:43] ouch, elastic2015 is being reimaged, I messed up dowtimes in icinga, please ignore [15:05:38] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2067 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:08:14] 10Operations, 10Citoid, 10Services, 10Service-deployment-requests, and 2 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10mobrovac) [15:08:38] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1105 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:10:48] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1366 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:11:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Cmjohnson) @Andrew and @robh I replaced the disk with a SSD. Let me know if it works [15:12:42] ACKNOWLEDGEMENT - HP RAID on labvirt1003 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:18 - Failed: 2I:1:17 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T201616 [15:12:44] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1003 - https://phabricator.wikimedia.org/T201616 (10ops-monitoring-bot) [15:13:48] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1044 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:13:52] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2015.codfw.wmnet'] ``` and were **ALL** successful. [15:14:02] (03PS1) 10Muehlenhoff: Enable microcode on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/451646 (https://phabricator.wikimedia.org/T127825) [15:14:31] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@92e5071]: Record format stats (duration: 17m 07s) [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:48] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04246 https://grafana.wikimedia.org/dashboard/db/logstash [15:21:30] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) >>! In T197242#4441138, @faidon wrote: > Is there any progress and/or timeline for this? Thank... [15:21:49] logstash1008 recovered on it’s own. interestingly there was a gc spike in https://grafana.wikimedia.org/dashboard/db/jvm-overview-work-in-progress-gehel on the same host in similar time frame [15:23:09] herron: I'll have a look at the GC logs now that we have them... [15:23:44] cool sounds good [15:24:08] I have some prometheus logstash metrics to set up as well, hopefully those will be useful as well [15:24:10] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) [15:29:26] herron: looks like there is semi-regular high GC activity [15:30:33] * gehel is wondering if we should experiment with G1 [15:31:06] it definitely looks like the ingestion workload isn't smooth at all [15:31:29] but I can't correlate that with the insertion rate in grafana [15:31:43] might be on a specific ingester, which is more expensive than others? [15:31:52] or output? [15:32:01] * gehel is looking at those API feature logs [15:35:13] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1709 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:36:14] yeah it might be worth experimenting with G1 [15:36:30] fwiw I also bumped up the heap sizes, though whats made the most difference was the persistent queueing afaict [15:36:50] first from 256m to 512m, then again from 512m to 1g the next day [15:37:03] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Cmjohnson) [15:37:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:39:55] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10Gehel) Looking at GC logs, it seems that we have a correlation between GC activity and packet loss (at least to some extend) Heap Occupancy After GC: {F24734444} Looking at GC logs ov... [15:40:04] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:44:15] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10Cmjohnson) @ayounsi assigning this to you. Everything has been updated on the switch, verified what they were and disabled the ports. asw-b-eqiad.mgmt.eqiad.wmnet ge-4/0... [15:45:34] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1003 - https://phabricator.wikimedia.org/T201616 (10Cmjohnson) [15:45:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Cmjohnson) [15:46:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Cmjohnson) 05duplicate>03Open [15:47:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2016.codfw.wmnet', 'elastic2017... [15:48:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom zinc/WMF3298 - https://phabricator.wikimedia.org/T191352 (10Cmjohnson) [15:48:05] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1003 - https://phabricator.wikimedia.org/T201616 (10Cmjohnson) [15:52:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Bstorm) Status is listed as "failed" at the moment on the web interface. I'll check if there is anything else to be found. [15:53:14] PROBLEM - Device not healthy -SMART- on labvirt1003 is CRITICAL: cluster=labvirt device=cciss,17 instance=labvirt1003:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labvirt1003&var-datasource=eqiad%2520prometheus%252Fops [15:54:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom zinc/WMF3298 - https://phabricator.wikimedia.org/T191352 (10Cmjohnson) [15:55:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Bstorm) Yeah, hpssacli says similar. Doesn't seem to like that drive: ``` physicaldrive 2I:1:17 (port 2I:box 1:bay 17, Solid State SATA, 300.0 GB, Failed) ``` [15:55:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom vanadium/WMF3291 - https://phabricator.wikimedia.org/T191351 (10Cmjohnson) [15:56:14] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:56:14] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission Vanadium - https://phabricator.wikimedia.org/T182015 (10Cmjohnson) 05Open>03Resolved [15:57:33] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 34 ESP OK [15:58:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:00:04] godog, moritzm, and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1600). [16:00:05] Krenair: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:31] (03PS1) 10Cmjohnson: Remvong mgmt dns for decom host copper [dns] - 10https://gerrit.wikimedia.org/r/451652 (https://phabricator.wikimedia.org/T176957) [16:02:38] (03CR) 10Cmjohnson: [C: 032] Remvong mgmt dns for decom host copper [dns] - 10https://gerrit.wikimedia.org/r/451652 (https://phabricator.wikimedia.org/T176957) (owner: 10Cmjohnson) [16:02:38] (03PS2) 10Cmjohnson: Remvong mgmt dns for decom host copper [dns] - 10https://gerrit.wikimedia.org/r/451652 (https://phabricator.wikimedia.org/T176957) [16:02:40] (03CR) 10Cmjohnson: [V: 032 C: 032] Remvong mgmt dns for decom host copper [dns] - 10https://gerrit.wikimedia.org/r/451652 (https://phabricator.wikimedia.org/T176957) (owner: 10Cmjohnson) [16:04:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Cmjohnson) it's a SATA disk and they have SAS disks. I will look around but I don't think I have a 2.5" spare SAS disk [16:05:37] 10Operations, 10ops-eqiad, 10Packaging, 10decommission, 10Patch-For-Review: Decommission host copper.eqiad.wmnet - https://phabricator.wikimedia.org/T176957 (10Cmjohnson) [16:07:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Cmjohnson) [16:07:19] 10Operations, 10ops-eqiad, 10Packaging, 10decommission, 10Patch-For-Review: Decommission host copper.eqiad.wmnet - https://phabricator.wikimedia.org/T176957 (10Cmjohnson) 05Open>03Resolved [16:08:23] moritzm, hey [16:10:52] huh [16:11:05] looks like none of the deployers showed up to the window [16:12:08] last time stuff was on fire and no one even looked at my changes [16:12:27] 10Operations, 10ops-eqiad, 10PoolCounter, 10decommission, 10Patch-For-Review: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 (10Cmjohnson) [16:12:33] 10Operations, 10ops-eqiad, 10PoolCounter, 10decommission, 10Patch-For-Review: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 (10Cmjohnson) 05Open>03Resolved [16:12:48] this time no one shows up? [16:12:52] greg-g, any idea what's going on? [16:13:21] puppetswat you mean? [16:13:25] yes [16:13:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2017.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2017.codfw.wmnet... [16:13:50] (03PS1) 10Ema: ATS: outbound TLS connections to the appservers [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) [16:13:58] Well, there's still quite a bit of time left in the window [16:14:11] well also, I donno about the patchlist [16:14:17] true [16:14:27] you've got 6 in there with the header "I don't expect all of these to actually be merged but some review might be nice:" [16:14:43] but normally in swat windows people are there at the beginning of the window [16:14:50] I haven't done puppetswat in a whlie, but last I recall it was for merging trivial backlogged stuff, not detailed review of risky patches? [16:14:59] i'll merge one if it doesnt mean i have to merge the other 5 [16:15:18] (03PS2) 10Dzahn: toollabs::proxy: Fix logo_height [puppet] - 10https://gerrit.wikimedia.org/r/448999 (owner: 10Alex Monk) [16:15:32] bblack, they're not all risky [16:15:47] ok [16:16:11] (03CR) 10Dzahn: [C: 032] toollabs::proxy: Fix logo_height [puppet] - 10https://gerrit.wikimedia.org/r/448999 (owner: 10Alex Monk) [16:19:34] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 52558 MB (10% inode=99%) [16:20:04] RECOVERY - HTTPS-policy on policy.wikimedia.org is OK: SSL OK - Certificate policy.wikimedia.org valid until 2018-11-07 15:18:05 +0000 (expires in 89 days) [16:22:06] like the exim one at the top, you can probably check with puppet-compiler whether that's effectively a no-op in prod [16:22:33] Krenair: doing that now actually [16:23:14] the next one adds some optional stuff within labs that you still have to turn on to change anything [16:23:56] the third moves a few strings (domain names) to hiera, and should be a no-op in prod, checkable with puppet compiler [16:25:07] 4 is slightly complicated to explain but concerns a file that I think exists only in labs puppetmasters, though it's old and I haven't checked it recently [16:25:28] 5 is fixing the height of a logo on a couple of toollabs error pages to not be squished, mutante took care of that [16:26:16] looking at the url-downloader thing [16:26:31] 6 is trying to make the urldownloader puppet stuff compatible with the squid package in stretch (squid3 -> squid in a bunch of places), I also tried at one point to ask people what distribution the prod hosts using that role ran [16:26:46] herron: thanks [16:27:44] these changes aren't quite 'update outdated list of deployment-prep hosts', but they're also not 'touch prod apache/varnish/etc. config' [16:28:09] I am perfectly happy for someone to go through the list and say each one is ineligible for puppet swat [16:28:21] if they really believe that [16:28:55] the one for cumin, would be nice if it had a +1 from volans, since i see he reviewed it before and had some comments [16:29:24] but what I would appreciate in general is clarity on who should review each thing (mutante has provided that to me in this case) [16:29:33] (03PS1) 10Ema: Lower TTL of cache_misc sites to 30s [dns] - 10https://gerrit.wikimedia.org/r/451656 (https://phabricator.wikimedia.org/T164609) [16:29:43] RECOVERY - Disk space on elastic1023 is OK: DISK OK [16:29:49] mutante, cool, I'll chat to him about that one then [16:30:59] Krenair: squid change compiler output: http://puppet-compiler.wmflabs.org/12029/alsafi.wikimedia.org/ seeing if we really expect that diff [16:31:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Bstorm) Yeah, it seems likely we are going to be buying a disk @RobH [16:31:53] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 34 ESP OK [16:32:54] oh noes, whitespace [16:33:40] Krenair: indeed it is a noop in prod, but sadly I don’t have time at the moment to review in detail and be on standby after merge. Sorry about that. I made a note to myself to follow up on this [16:34:07] herron, that's great anyway, thank you [16:34:18] sure np, will be in touch later [16:34:45] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Suggested plan to have asw2-a-eqiad similar to codfw: {F24734668} Need to be added: fpc1-fpc3 fpc3-fpc4 fpc5-fpc6 Need to be removed: fpc2-fpc4 fpc6-fpc... [16:36:13] mutante, alright one sec [16:37:42] (03PS2) 10Ema: Lower TTL of cache_misc sites to 30s [dns] - 10https://gerrit.wikimedia.org/r/451656 (https://phabricator.wikimedia.org/T164609) [16:37:49] (03PS6) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [16:38:39] (03CR) 10BBlack: [C: 031] Lower TTL of cache_misc sites to 30s [dns] - 10https://gerrit.wikimedia.org/r/451656 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [16:38:50] jouncebot: next [16:38:51] In 0 hour(s) and 21 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1700) [16:39:14] (03CR) 10Ema: [C: 032] Lower TTL of cache_misc sites to 30s [dns] - 10https://gerrit.wikimedia.org/r/451656 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [16:43:34] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [16:44:34] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [16:44:48] ^ ignore that (still) [16:44:54] we should probably file a task to fix it up [16:45:29] [doing] [16:45:36] bblack: is that because "total requests" is taking PURGE into account? [16:46:04] (03PS7) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [16:47:10] (03CR) 10Arturo Borrero Gonzalez: "Preliminary compilation test seems OK:" [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [16:48:15] 10Operations, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10BBlack) p:05Triage>03Normal [16:48:50] (03PS1) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [16:49:07] (03PS8) 10Dzahn: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [16:49:07] bblack: yeah so the combo of site depooled+purge spikes probably [16:49:13] ema: I don't think so, I think it's looking at "rate-of-GET" vs same number 30 mins back, and the variance is just naturally-large when most real traffic is depooled [16:49:24] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [16:49:27] (03CR) 10Dzahn: [C: 032] "noop now: https://puppet-compiler.wmflabs.org/compiler03/12032/" [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [16:49:49] PURGE shouldn't match GET, I hope! :) [16:50:31] bblack: ah yes, the alert is not based on 'total requests', but on `varnish_requests:rate5m{method="GET" [...]` [16:51:48] (03PS1) 10Ema: Use cache_text instead of cache_misc for all sites [dns] - 10https://gerrit.wikimedia.org/r/451659 (https://phabricator.wikimedia.org/T164609) [16:52:19] (03PS2) 10Ema: Use cache_text instead of cache_misc for all misc sites [dns] - 10https://gerrit.wikimedia.org/r/451659 (https://phabricator.wikimedia.org/T164609) [16:53:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Bstorm) Note: The old disk was a predictive failure. @Cmjohnson it could probably be put back for the time being just to prevent a more serious issue before we can get a repl... [16:54:07] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Papaul) [16:54:21] 10Operations, 10Traffic, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10ema) [16:55:47] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [16:56:52] (03CR) 10EBernhardson: [C: 031] "While in theory i don't think there is anything else really equivalent, In practice i agree. I don't think I've ever seen this alert for a" [puppet] - 10https://gerrit.wikimedia.org/r/451583 (owner: 10Gehel) [16:57:48] (03PS2) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [16:58:03] (03CR) 10Volans: [C: 04-1] "A couple of things still to fix/modify, see inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [16:58:26] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1700). [17:06:00] The toollabs change worked btw [17:06:06] logo on https://tools.wmflabs.org/.error/technicalissues/ is no longer squished [17:06:34] ORES is dropping a minor release today, T201518 [17:06:35] T201518: ORES deployment (Early August) - https://phabricator.wikimedia.org/T201518 [17:08:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10RobH) Ok, so we'll need to buy some 300GB SFF SAS disks, correct? I'll create a #procurement task and link to this. [17:09:04] (03PS2) 10Framawiki: Add media.farsnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450391 (https://phabricator.wikimedia.org/T200872) [17:09:32] (03PS1) 10Gergő Tisza: Add temporary debug logging to ThumbnailRenderJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) [17:10:15] (03PS3) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [17:10:53] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [17:11:41] (03CR) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [17:12:50] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Papaul) switch port information ge-3/0/16 ``` show interfaces ge-3/0/16 Physical interface: ge-3/0/16, Administratively down, Physical link is Down Interface index: 1187,... [17:13:07] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Papaul) [17:14:48] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@6581b28]: Update mobileapps to 616ffef (T191640) [17:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:54] T191640: Create an announcement card promoting the new multilingual feature on the Explore feed - https://phabricator.wikimedia.org/T191640 [17:16:38] (03PS4) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [17:17:13] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [17:18:42] (03PS5) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [17:20:08] (03PS1) 10Papaul: DNS: Remove mgmt DNS for mw2017 [dns] - 10https://gerrit.wikimedia.org/r/451668 [17:20:11] !log awight@deploy1001 Started deploy [ores/deploy@1712122]: fawiki wp10; ORES update: T201518 [17:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:18] T201518: ORES deployment (Early August) - https://phabricator.wikimedia.org/T201518 [17:21:22] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@6581b28]: Update mobileapps to 616ffef (T191640) (duration: 06m 35s) [17:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:29] T191640: Create an announcement card promoting the new multilingual feature on the Explore feed - https://phabricator.wikimedia.org/T191640 [17:25:12] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Suggestion: maybe just have these generate long-term to the generic sha... [17:25:37] (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS for mw2017 [dns] - 10https://gerrit.wikimedia.org/r/451668 (owner: 10Papaul) [17:26:33] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Dzahn) [17:26:54] !log mforns@deploy1001 Started deploy [analytics/refinery@9f4267c]: deploy refinery together with refinery-source v0.0.68 [17:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:52] (03CR) 10Mobrovac: [C: 031] Add temporary debug logging to ThumbnailRenderJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) (owner: 10Gergő Tisza) [17:28:25] ORES canary is verified working, continuing with our deployment… [17:31:54] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:32:23] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:32:34] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:32:34] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:33:09] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Cmjohnson) okay, I am changing names to icinga1001 [17:33:14] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:33:17] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Cmjohnson) [17:33:24] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:34:14] PROBLEM - Device not healthy -SMART- on cp2009 is CRITICAL: cluster=misc device={sdc,sdd,sde} instance=cp2009:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp2009&var-datasource=codfw%2520prometheus%252Fops [17:35:17] !log mforns@deploy1001 Finished deploy [analytics/refinery@9f4267c]: deploy refinery together with refinery-source v0.0.68 (duration: 08m 22s) [17:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:39] herron: just the kubernetes-api checks that were considered unrelated.. right [17:35:44] argone/chlorine/neon/etcd/... ? [17:36:05] k8s right [17:36:06] ok [17:36:37] yes afaik [17:37:02] and I don't think cp2009 is even supposed to have devices sdc,sdd,sde... looking [17:37:44] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:38:04] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:38:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:39:23] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:39:33] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:39:33] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:39:47] awight: did you just finish ORES deploy? [17:43:22] (03PS1) 10Dzahn: squid3: use separate files for logrotate snippets, squid vs squid3 [puppet] - 10https://gerrit.wikimedia.org/r/451671 [17:44:05] ACKNOWLEDGEMENT - Device not healthy -SMART- on cp2009 is CRITICAL: cluster=misc device={sdc,sdd,sde} instance=cp2009:9100 job=node site=codfw Brandon Black ATS test host, non-critical but strange, will ping ema https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp2009&var-datasource=codfw%2520prometheus%252Fops [17:44:38] (03PS13) 10Gehel: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [17:44:57] jouncebot: next [17:44:57] In 0 hour(s) and 15 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1800) [17:45:27] (03PS1) 10Volans: wmf-auto-reimage: fix get message from cumin [puppet] - 10https://gerrit.wikimedia.org/r/451673 [17:45:35] mutante: Uh oh, no it’s still chugging along, at 70% fetched. [17:45:44] * awight squints at graphs [17:46:16] (03CR) 10Gehel: "LGTM, trivial enough" [puppet] - 10https://gerrit.wikimedia.org/r/451673 (owner: 10Volans) [17:46:40] mutante: You’re probably not going to get in my way, if you’re asking cos you’re waiting to do something…? [17:46:41] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix get message from cumin [puppet] - 10https://gerrit.wikimedia.org/r/451673 (owner: 10Volans) [17:47:03] PROBLEM - Check Varnish expiry mailbox lag on cp1076 is CRITICAL: CRITICAL: expiry mailbox lag is 2073256 [17:48:21] (03CR) 10Muehlenhoff: [C: 032] Enable microcode on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/451646 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [17:48:22] awight: no, it wasn't that, i wast just curious if the time matches those kubernetes alerts.. even though i realized it's not on kubernetes yet [17:48:30] (03PS2) 10Muehlenhoff: Enable microcode on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/451646 (https://phabricator.wikimedia.org/T127825) [17:48:41] it started right after the deploy was announced [17:49:04] (03CR) 10Muehlenhoff: [V: 032 C: 032] Enable microcode on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/451646 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [17:51:29] mutante: ah interesting. I could have been soaking bandwidth, this is a yuge and IO-limited git pull, maybe 5GB each onto 18 nodes. But it’s still in the fetch stage, so that would have been the *only* possible side-effect. [17:51:51] (5GB is worst-case, git caching is hopefully reducing that by some large degree...) [17:52:09] !log awight@deploy1001 Finished deploy [ores/deploy@1712122]: fawiki wp10; ORES update: T201518 (duration: 31m 58s) [17:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:16] T201518: ORES deployment (Early August) - https://phabricator.wikimedia.org/T201518 [17:53:10] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, 10WMDE-QWERTY-Sprint-2018-07-31: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10MoritzMuehlenhoff) >>! In T199801#4471991, @WMDE-Fisch wrote: > Just as a heads-up: We'r... [17:53:18] mutante: FYI ^ in case you see anything else. Cheers! [17:53:37] (03PS14) 10Gehel: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [17:53:43] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6, cp1050_v4, cp1050_v6, cp1062_v4, cp1062_v6, cp1063_v4, cp1063_v6, cp1064_v4, cp1064_v6, cp1071_v4, cp1071_v6, cp1072_v4, cp1072_v6, cp1073_v4, cp1073_v6, cp1074_v4, cp1074_v6, cp1099_v4, cp1099_v6 [17:54:03] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6, cp1050_v4, cp1050_v6, cp1062_v4, cp1062_v6, cp1063_v4, cp1063_v6, cp1064_v4, cp1064_v6, cp1071_v4, cp1071_v6, cp1072_v4, cp1072_v6, cp1073_v4, cp1073_v6, cp1074_v4, cp1074_v6, cp1099_v4, cp1099_v6 [17:54:04] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1052_v4, cp1052_v6, cp1053_v4, cp1053_v6, cp1054_v4, cp1054_v6, cp1055_v4, cp1055_v6, cp1065_v4, cp1065_v6, cp1066_v4, cp1066_v6, cp1067_v4, cp1067_v6, cp1068_v4, cp1068_v6 [17:54:04] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6, cp1050_v4, cp1050_v6, cp1062_v4, cp1062_v6, cp1063_v4, cp1063_v6, cp1064_v4, cp1064_v6, cp1071_v4, cp1071_v6, cp1072_v4, cp1072_v6, cp1073_v4, cp1073_v6, cp1074_v4, cp1074_v6, cp1099_v4, cp1099_v6 [17:54:14] ah, downtime expiries [17:54:14] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6, cp1050_v4, cp1050_v6, cp1062_v4, cp1062_v6, cp1063_v4, cp1063_v6, cp1064_v4, cp1064_v6, cp1071_v4, cp1071_v6, cp1072_v4, cp1072_v6, cp1073_v4, cp1073_v6, cp1074_v4, cp1074_v6, cp1099_v4, cp1099_v6 [17:54:15] !log installing gnupg security updates on trusty (Debian already fixed) [17:54:18] sorry for the spam, ipsec is awful [17:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:24] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1048_v4, cp1048_v6, cp1049_v4, cp1049_v6, cp1050_v4, cp1050_v6, cp1062_v4, cp1062_v6, cp1063_v4, cp1063_v6, cp1064_v4, cp1064_v6, cp1071_v4, cp1071_v6, cp1072_v4, cp1072_v6, cp1073_v4, cp1073_v6, cp1074_v4, cp1074_v6, cp1099_v4, cp1099_v6 [17:54:26] will re-downtime these [17:54:31] (03CR) 10Gehel: [C: 032] Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [17:54:34] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1052_v4, cp1052_v6, cp1053_v4, cp1053_v6, cp1054_v4, cp1054_v6, cp1055_v4, cp1055_v6, cp1065_v4, cp1065_v6, cp1066_v4, cp1066_v6, cp1067_v4, cp1067_v6, cp1068_v4, cp1068_v6 [17:54:34] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp1052_v4, cp1052_v6, cp1053_v4, cp1053_v6, cp1054_v4, cp1054_v6, cp1055_v4, cp1055_v6, cp1065_v4, cp1065_v6, cp1066_v4, cp1066_v6, cp1067_v4, cp1067_v6, cp1068_v4, cp1068_v6 [17:56:08] re-downtimed through mid-monday [17:58:03] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:58:51] !log connecting asw2-a5-eqiad to asw2-a6-eqiad - T201145 [17:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:21] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [17:59:44] PROBLEM - Check systemd state on elastic2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:12] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1800). [18:00:12] framawiki and tgr: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:23] o/ [18:00:24] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:33] PROBLEM - Check systemd state on elastic1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:34] PROBLEM - Check systemd state on elastic1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:34] PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:01:26] PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:01:36] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:46] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:02:05] PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:02:06] PROBLEM - Check systemd state on elastic2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:02:42] (03PS1) 10BBlack: cp1080: fix macaddr typo [puppet] - 10https://gerrit.wikimedia.org/r/451677 [18:02:45] PROBLEM - Check systemd state on elastic1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:02:53] (03CR) 10BBlack: [V: 032 C: 032] cp1080: fix macaddr typo [puppet] - 10https://gerrit.wikimedia.org/r/451677 (owner: 10BBlack) [18:02:55] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:03:05] elastic is me, config issue with latest merge, silencing, no impact yet [18:03:10] gehel: seems that mjolnir-kafka-bulk-daemon.service is dead [18:03:12] thx [18:03:15] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [18:03:16] ah got it [18:03:26] yeah, bad puppet merge :/ [18:03:36] relforge? [18:03:46] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:03:46] relforge as well, same puppet merge [18:03:55] PROBLEM - Check systemd state on elastic1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:03:58] sorry for the spam [18:04:06] PROBLEM - Check systemd state on elastic2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:04:15] PROBLEM - Check systemd state on elastic1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:04:16] PROBLEM - Check systemd state on elastic1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:04:26] Anybody for the current swat window ? [18:04:50] I can SWAT. [18:05:32] Wait what, the keys have changed. [18:05:36] PROBLEM - Check systemd state on elastic2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:05:41] * Niharika looks for email [18:06:10] Niharika bast1002 has been reinstalled if that's what you were referring to [18:06:35] volans: I'm getting host key changed on deployment.eqiad.wmnet [18:06:46] Which is separate from bast1002, right? [18:06:59] yes but you connect through the bastion [18:07:11] Ah, right. I forgot that part. [18:07:15] PROBLEM - Check systemd state on elastic1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:07:16] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:07:19] (03PS1) 10BBlack: Revert "cp1080: remove from conftool/hieradata lists" [puppet] - 10https://gerrit.wikimedia.org/r/451678 (https://phabricator.wikimedia.org/T201174) [18:07:25] Niharika: depends which bastion do you have configured ofc [18:07:26] PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:07:35] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:07:56] Niharika: the deployment host is deploy1001, dunno when was last time you logged to it, is recent but not brand new :) [18:08:14] Niharika: here's the new one https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1002.wikimedia.org [18:08:16] PROBLEM - Check systemd state on elastic1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:08:34] Niharika: if it matches that it's fine.. it had to be reinstalled [18:08:46] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:09:38] volans: Yep but using deployment.eqiad.wmnet reroutes to deploy1001 (or whichever is current) AFAIK. I used it a couple weeks ago. [18:09:44] mutante: Thanks, that worked. [18:10:05] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:10:26] PROBLEM - Check systemd state on elastic1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:10:26] PROBLEM - Check systemd state on elastic2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:10:33] (03PS1) 10Cmjohnson: Adding mgmt dns for several new servers [dns] - 10https://gerrit.wikimedia.org/r/451680 (https://phabricator.wikimedia.org/T201343) [18:10:45] ^ systemd on elastic* those are all fine, i mucked up something and am preping a deploy to a service [18:10:56] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for several new servers [dns] - 10https://gerrit.wikimedia.org/r/451680 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson) [18:11:28] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450391 (https://phabricator.wikimedia.org/T200872) (owner: 10Framawiki) [18:11:46] PROBLEM - Check systemd state on elastic2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:11:46] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:05] tgr: You around? [18:12:12] Niharika: here [18:12:16] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:24] Ack. [18:12:31] Niharika: note that I'll not be able verify that the patch works as expected (except that Commons still loads) :) [18:12:35] PROBLEM - Check systemd state on elastic2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:35] PROBLEM - Check systemd state on elastic2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:36] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:46] PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:50] framawiki: Got it. I'll sync it directly. [18:13:00] (03Merged) 10jenkins-bot: Add media.farsnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450391 (https://phabricator.wikimedia.org/T200872) (owner: 10Framawiki) [18:13:01] framawiki: Thanks for the patch. :) [18:13:30] (03CR) 10jenkins-bot: Add media.farsnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450391 (https://phabricator.wikimedia.org/T200872) (owner: 10Framawiki) [18:13:55] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:13:56] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:14:05] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:14:45] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:15:36] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:15:52] !log niharika29@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:16] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:16:26] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:16:45] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:15] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:20] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add media.farsnews.com to wgCopyUploadDomains T200872 (duration: 01m 00s) [18:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:26] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:26] T200872: Please add farsnews.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T200872 [18:17:45] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:55] PROBLEM - Check systemd state on elastic1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:56] PROBLEM - Check systemd state on elastic2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:18:25] thanks Niharika ! [18:18:31] framawiki: You're welcome! [18:18:55] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:18:55] PROBLEM - Check systemd state on elastic2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:18:56] PROBLEM - Check systemd state on elastic2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:19:39] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@22a50af]: temp fix while waiting for jenkins [18:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:06] PROBLEM - Check systemd state on elastic1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:20:56] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:21:00] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150 (10Mvolz) I've tried to log-in with my LDAP credentials and couldn't. I've tried every username/email/password combo, I also tried the rese... [18:21:06] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:21:06] PROBLEM - Check systemd state on elastic2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:06] PROBLEM - Check systemd state on elastic2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:14] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@22a50af]: temp fix while waiting for jenkins (duration: 03m 35s) [18:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:26] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:35] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:36] PROBLEM - Check systemd state on elastic2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:45] PROBLEM - Check systemd state on elastic2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:25] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:35] PROBLEM - Check systemd state on elastic2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:35] PROBLEM - Check systemd state on elastic2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:36] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:56] PROBLEM - Check systemd state on elastic1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:25:15] PROBLEM - Check systemd state on elastic2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:25:55] PROBLEM - Check systemd state on elastic1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:26:13] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@39be980]: Repair daemons failing to load logging config [18:26:16] PROBLEM - Check systemd state on elastic1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:05] PROBLEM - Check systemd state on elastic2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:28:15] PROBLEM - Check systemd state on elastic2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:28:16] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:29:06] RECOVERY - Check systemd state on elastic1026 is OK: OK - running: The system is fully operational [18:29:06] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [18:29:46] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@39be980]: Repair daemons failing to load logging config (duration: 03m 33s) [18:29:49] acknowledgment spam coming, sorry... [18:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:00] Gah. Zuul is VERY backlogged. [18:30:06] RECOVERY - Check systemd state on elastic2001 is OK: OK - running: The system is fully operational [18:30:06] ACKNOWLEDGEMENT - Check systemd state on elastic2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel waiting for fix to daemon [18:30:06] ACKNOWLEDGEMENT - Check systemd state on elastic2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel waiting for fix to daemon [18:30:06] ACKNOWLEDGEMENT - Check systemd state on elastic2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel waiting for fix to daemon [18:30:06] ACKNOWLEDGEMENT - Check systemd state on elastic2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel waiting for fix to daemon [18:30:06] ACKNOWLEDGEMENT - Check systemd state on elastic2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel waiting for fix to daemon [18:30:06] ACKNOWLEDGEMENT - Check systemd state on elastic2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel waiting for fix to daemon [18:30:33] and recovery spam in a few minutes as well, apologies to all for the noise [18:32:14] RECOVERY - Check systemd state on elastic1023 is OK: OK - running: The system is fully operational [18:32:23] RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational [18:32:23] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational [18:32:23] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational [18:32:24] RECOVERY - Check systemd state on elastic2003 is OK: OK - running: The system is fully operational [18:32:24] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational [18:32:24] RECOVERY - Check systemd state on elastic1030 is OK: OK - running: The system is fully operational [18:32:24] RECOVERY - Check systemd state on elastic2021 is OK: OK - running: The system is fully operational [18:32:25] RECOVERY - Check systemd state on elastic2019 is OK: OK - running: The system is fully operational [18:32:25] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational [18:32:26] RECOVERY - Check systemd state on elastic2022 is OK: OK - running: The system is fully operational [18:32:26] RECOVERY - Check systemd state on elastic2023 is OK: OK - running: The system is fully operational [18:32:27] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [18:32:43] RECOVERY - Check systemd state on elastic1031 is OK: OK - running: The system is fully operational [18:32:43] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational [18:32:43] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational [18:32:44] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational [18:32:44] RECOVERY - Check systemd state on elastic2007 is OK: OK - running: The system is fully operational [18:32:44] RECOVERY - Check systemd state on elastic2009 is OK: OK - running: The system is fully operational [18:32:44] RECOVERY - Check systemd state on elastic2004 is OK: OK - running: The system is fully operational [18:32:45] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational [18:32:45] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [18:32:46] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational [18:32:53] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational [18:32:53] RECOVERY - Check systemd state on elastic1028 is OK: OK - running: The system is fully operational [18:32:53] RECOVERY - Check systemd state on elastic2016 is OK: OK - running: The system is fully operational [18:32:54] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational [18:32:54] RECOVERY - Check systemd state on elastic1040 is OK: OK - running: The system is fully operational [18:32:54] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational [18:33:03] RECOVERY - Check systemd state on elastic2005 is OK: OK - running: The system is fully operational [18:33:03] RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational [18:33:03] RECOVERY - Check systemd state on elastic2024 is OK: OK - running: The system is fully operational [18:33:03] RECOVERY - Check systemd state on elastic2002 is OK: OK - running: The system is fully operational [18:33:03] RECOVERY - Check systemd state on elastic1022 is OK: OK - running: The system is fully operational [18:33:04] RECOVERY - Check systemd state on elastic2012 is OK: OK - running: The system is fully operational [18:33:04] RECOVERY - Check systemd state on elastic2008 is OK: OK - running: The system is fully operational [18:33:04] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational [18:33:05] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational [18:33:05] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational [18:33:06] RECOVERY - Check systemd state on elastic1041 is OK: OK - running: The system is fully operational [18:33:06] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [18:33:13] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational [18:33:13] RECOVERY - Check systemd state on elastic1024 is OK: OK - running: The system is fully operational [18:33:14] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [18:33:14] RECOVERY - Check systemd state on elastic1029 is OK: OK - running: The system is fully operational [18:33:14] RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational [18:33:14] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational [18:33:14] RECOVERY - Check systemd state on elastic2013 is OK: OK - running: The system is fully operational [18:33:15] RECOVERY - Check systemd state on elastic2014 is OK: OK - running: The system is fully operational [18:33:15] RECOVERY - Check systemd state on elastic2010 is OK: OK - running: The system is fully operational [18:33:16] RECOVERY - Check systemd state on elastic1033 is OK: OK - running: The system is fully operational [18:35:13] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [18:35:53] RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational [18:40:33] greg-g: What's the recommended way to handle a patch which has been +2d in SWAT but won't be merged for another hour and a half because Zuul is overloaded? I can swat it when it merges but that technically falls in the train window. [18:42:03] Niharika: If it's ultra-urgent you can log into Jenkins, kill all the in-flight jobs until your ones run, and then re-trigger the failed merges. [18:42:12] Niharika: But that's… pretty aggressive. [18:42:54] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational [18:43:11] James_F: It's not urgent. I'm wondering if I'm allowed to swat during train or should I wait until train ends? And what if the swat requester has left by then? [18:43:29] jouncebot: now [18:43:29] For the next 0 hour(s) and 16 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1800) [18:44:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [18:44:43] RECOVERY - Check systemd state on elastic1019 is OK: OK - running: The system is fully operational [18:44:43] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational [18:45:54] tgr: It seems like the patch failed in the gate pipeline build: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/451663/ Although I can't figure out the error. [18:46:24] RECOVERY - Check systemd state on elastic2011 is OK: OK - running: The system is fully operational [18:46:43] is morning swat over? [18:48:09] (03PS3) 10BBlack: Use cache_text instead of cache_misc for all misc sites [dns] - 10https://gerrit.wikimedia.org/r/451659 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [18:48:31] Niharika: just force it through maybe? it's a rather harmless patch [18:48:53] (03CR) 10BBlack: [C: 032] Use cache_text instead of cache_misc for all misc sites [dns] - 10https://gerrit.wikimedia.org/r/451659 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [18:49:23] !log remaining cache_misc sites will move to cache_text shortly - https://gerrit.wikimedia.org/r/451659 [18:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:56] tgr: I'll appreciate another set of eyes on it. James_F could you look over https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/451663/ once? [18:50:03] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational [18:50:15] bblack: It's not over yet. [18:50:32] ok [18:51:44] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational [18:52:20] (03PS2) 10Dzahn: squid3: use separate files for logrotate snippets, squid vs squid3 [puppet] - 10https://gerrit.wikimedia.org/r/451671 [18:53:03] filed the npm issue as https://phabricator.wikimedia.org/T201638 [18:55:37] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) (owner: 10Gergő Tisza) [18:57:06] (03CR) 10Dzahn: [C: 032] squid3: use separate files for logrotate snippets, squid vs squid3 [puppet] - 10https://gerrit.wikimedia.org/r/451671 (owner: 10Dzahn) [18:58:11] !log niharika29@deploy1001 Synchronized php-1.32.0-wmf.16/includes/jobqueue/jobs/ThumbnailRenderJob.php: Add temporary debug logging T201305 (duration: 01m 11s) [18:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:18] T201305: ThumbnailRender jobs not including the width in the fetch URL - https://phabricator.wikimedia.org/T201305 [18:58:37] (03PS2) 10Niharika29: Add temporary debug logging to ThumbnailRenderJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) (owner: 10Gergő Tisza) [18:58:47] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) (owner: 10Gergő Tisza) [19:00:05] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Americas version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T1900). [19:00:15] (03Merged) 10jenkins-bot: Add temporary debug logging to ThumbnailRenderJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) (owner: 10Gergő Tisza) [19:00:29] (03CR) 10jenkins-bot: Add temporary debug logging to ThumbnailRenderJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451666 (https://phabricator.wikimedia.org/T201305) (owner: 10Gergő Tisza) [19:00:53] (03CR) 10Dzahn: [C: 032] "affects url-downloader hosts (alsafi, alcyone,...) compiler noop, doesn't show changes to files http://puppet-compiler.wmflabs.org/1203" [puppet] - 10https://gerrit.wikimedia.org/r/451671 (owner: 10Dzahn) [19:01:32] (03CR) 10Dzahn: [C: 032] "followed up with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451671/" [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [19:02:18] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add temporary debug logging to ThumbnailRenderJob T201305 (duration: 00m 57s) [19:02:19] tgr: Done. [19:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:26] SWAT over. [19:02:36] !log logstash1008:~# systemctl restart logstash T200960 [19:02:39] thanks Niharika! [19:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:43] T200960: Logstash packet loss - https://phabricator.wikimedia.org/T200960 [19:05:44] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.02663 https://grafana.wikimedia.org/dashboard/db/logstash [19:09:38] !log Preparing to deploy 1.32.0-wmf.16 to group2 wikis. [19:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:39] (03PS1) 1020after4: all wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451693 [19:10:47] (03CR) 1020after4: [C: 032] all wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451693 (owner: 1020after4) [19:11:22] Niharika: sorry, was in an interview, did you get an answer? [19:12:06] Niharika: ah, well, in this case just let twentyafterfour know that something is still in the pipeline, we can be semi-flexible on our start time [19:12:10] brb, lunch duty [19:12:36] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451693 (owner: 1020after4) [19:15:16] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.16 refs T191062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451693 (owner: 1020after4) [19:15:25] Niharika: if you ever need to SWAT during the train window definitely just ask and it shouldn't be a problem. There isn't necessarily a conflict we just don't want to step on eachothers' toes [19:16:24] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.16 refs T191062 [19:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:31] T191062: 1.32.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T191062 [19:20:24] (03PS1) 10BBlack: Revert TTLs back to 600 for misc->text moves [dns] - 10https://gerrit.wikimedia.org/r/451695 (https://phabricator.wikimedia.org/T164609) [19:21:57] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ayounsi) Switch ports descriptions updated. [19:31:18] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10herron) Looking at https://grafana.wikimedia.org/dashboard/db/logstash-herron-wip timing indeed does seem to line up between UDP loss on logstash1008 and GC activity on the same instanc... [19:33:34] (03CR) 10BBlack: [C: 032] Revert "cp1080: remove from conftool/hieradata lists" [puppet] - 10https://gerrit.wikimedia.org/r/451678 (https://phabricator.wikimedia.org/T201174) (owner: 10BBlack) [19:33:43] (03PS2) 10BBlack: Revert "cp1080: remove from conftool/hieradata lists" [puppet] - 10https://gerrit.wikimedia.org/r/451678 (https://phabricator.wikimedia.org/T201174) [19:33:47] (03CR) 10BBlack: [V: 032 C: 032] Revert "cp1080: remove from conftool/hieradata lists" [puppet] - 10https://gerrit.wikimedia.org/r/451678 (https://phabricator.wikimedia.org/T201174) (owner: 10BBlack) [19:34:46] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission mw2017 - https://phabricator.wikimedia.org/T187467 (10Papaul) 05Open>03Resolved [19:38:48] (03PS1) 10Herron: admin: update jmorgan ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/451697 (https://phabricator.wikimedia.org/T201185) [19:41:02] (03CR) 10Herron: [C: 032] admin: update jmorgan ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/451697 (https://phabricator.wikimedia.org/T201185) (owner: 10Herron) [19:42:23] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1080_v4, cp1080_v6 [19:43:43] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1080_v4, cp1080_v6 [19:46:14] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1080_v4, cp1080_v6 [19:46:35] bleh [19:46:39] there will be several of those [19:47:15] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1080_v4, cp1080_v6 [19:47:37] (03PS1) 10Zhuyifei1999: [WIP] Quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [19:48:17] Thanks greg-g and twentyafterfour. [19:49:24] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1080_v4, cp1080_v6 [19:50:23] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 36 ESP OK [19:50:23] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 36 ESP OK [19:50:33] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 36 ESP OK [19:50:33] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 36 ESP OK [19:50:53] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [19:51:23] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:51:44] PROBLEM - Check systemd state on elastic2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:51:53] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:03] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:03] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:04] PROBLEM - Check systemd state on elastic2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:04] PROBLEM - Check systemd state on elastic2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:04] PROBLEM - Check systemd state on elastic2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:13] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:13] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:14] elastic isn't me! [19:52:14] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:14] ^ thats me again testing new service...checking [19:52:14] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:24] PROBLEM - Check systemd state on elastic2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:24] PROBLEM - Check systemd state on elastic2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:24] PROBLEM - Check systemd state on elastic2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:33] PROBLEM - Check systemd state on elastic2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:33] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:33] PROBLEM - Check systemd state on elastic2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:33] PROBLEM - Check systemd state on elastic2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:34] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:34] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:35] PROBLEM - Check systemd state on elastic2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:43] PROBLEM - Check systemd state on elastic2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:44] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:44] PROBLEM - Check systemd state on elastic2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:44] PROBLEM - Check systemd state on elastic2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:44] PROBLEM - Check systemd state on elastic2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:44] PROBLEM - Check systemd state on elastic2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:53] PROBLEM - Check systemd state on elastic2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:53] PROBLEM - Check systemd state on elastic2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:54] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:54] PROBLEM - Check systemd state on elastic2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:54] PROBLEM - Check systemd state on elastic2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:54] PROBLEM - Check systemd state on elastic2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:59] interesting only codfw didn't like it [19:53:08] ssh-keygen -f "/home/ebernhardson/.ssh/known_hosts" -R "elastic2006.codfw.wmnet" [19:53:23] PROBLEM - Check systemd state on elastic2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:53:54] sigh, mirrormaker re-encoded the snappy and snappy isn't installed [19:54:01] s/the/to/ [19:54:14] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:00:54] (03PS15) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [20:04:20] (03PS1) 10EBernhardson: Install libsnappy to mjolnir daemon hosts [puppet] - 10https://gerrit.wikimedia.org/r/451700 [20:06:13] (03PS2) 10Zhuyifei1999: [WIP] Quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [20:06:38] (03PS1) 10BBlack: remove numa_networking: on for pending reboots [puppet] - 10https://gerrit.wikimedia.org/r/451701 [20:06:48] (03CR) 10Ottomata: [C: 032] Install libsnappy to mjolnir daemon hosts [puppet] - 10https://gerrit.wikimedia.org/r/451700 (owner: 10EBernhardson) [20:06:57] (03CR) 10BBlack: [V: 032 C: 032] remove numa_networking: on for pending reboots [puppet] - 10https://gerrit.wikimedia.org/r/451701 (owner: 10BBlack) [20:07:10] that was fast [20:07:18] (03PS2) 10BBlack: remove numa_networking: on for pending reboots [puppet] - 10https://gerrit.wikimedia.org/r/451701 [20:07:21] (03CR) 10BBlack: [V: 032 C: 032] remove numa_networking: on for pending reboots [puppet] - 10https://gerrit.wikimedia.org/r/451701 (owner: 10BBlack) [20:07:59] waiting on 2x changes [20:08:12] aye just pinged you in mwsec [20:08:14] can all go? [20:08:16] i'm ok to merge [20:08:21] herron's looks harmless [20:08:21] so yes [20:08:23] merging all [20:08:24] plz merge thanks [20:08:58] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [20:08:59] RECOVERY - Check systemd state on elastic2008 is OK: OK - running: The system is fully operational [20:09:09] PROBLEM - Check systemd state on elastic2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:09:54] thanks! [20:10:08] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 36 ESP OK [20:10:09] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 36 ESP OK [20:10:09] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [20:10:18] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 36 ESP OK [20:10:19] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 36 ESP OK [20:10:28] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [20:10:29] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [20:10:29] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [20:10:38] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 36 ESP OK [20:10:39] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [20:10:48] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [20:10:49] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 32 ESP OK [20:10:59] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 32 ESP OK [20:11:09] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 32 ESP OK [20:11:09] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:11:48] sigh ... i should have remembered we are in the middle of migration between jessie and stretch, so the package name has to vary i bet... [20:11:59] PROBLEM - Check systemd state on elastic2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:12:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10herron) 05Open>03Resolved @Capt_Swing the updated ssh keys have been added and will be deployed automatically a... [20:12:37] ebernhardson: seems libsnappy1 on jessie 1.1.2-3 [20:13:09] RECOVERY - Check systemd state on elastic2006 is OK: OK - running: The system is fully operational [20:13:28] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational [20:14:39] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:14:59] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:15:09] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:16:08] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:16:10] (03PS1) 10BBlack: Revert "remove numa_networking: on for pending reboots" [puppet] - 10https://gerrit.wikimedia.org/r/451708 [20:16:27] (03CR) 10BBlack: [V: 032 C: 032] Revert "remove numa_networking: on for pending reboots" [puppet] - 10https://gerrit.wikimedia.org/r/451708 (owner: 10BBlack) [20:16:28] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10herron) 05Open>03stalled [20:16:29] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:16:39] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:16:48] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:18:39] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1080.eqiad.wmnet [20:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:59] RECOVERY - Check systemd state on elastic2005 is OK: OK - running: The system is fully operational [20:19:04] (03PS1) 10Ottomata: mjolnir - Vary libsnappy package on debian version [puppet] - 10https://gerrit.wikimedia.org/r/451709 [20:19:11] ebernhardson: ^ [20:19:18] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:19:19] PROBLEM - puppet last run on elastic2023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:19:28] RECOVERY - Check systemd state on elastic2022 is OK: OK - running: The system is fully operational [20:19:36] (03CR) 10EBernhardson: [C: 031] "matches package names i see in jessie/stretch" [puppet] - 10https://gerrit.wikimedia.org/r/451709 (owner: 10Ottomata) [20:19:49] RECOVERY - Check systemd state on elastic2016 is OK: OK - running: The system is fully operational [20:20:00] (03CR) 10Ottomata: [C: 032] mjolnir - Vary libsnappy package on debian version [puppet] - 10https://gerrit.wikimedia.org/r/451709 (owner: 10Ottomata) [20:20:07] (03PS2) 10Ottomata: mjolnir - Vary libsnappy package on debian version [puppet] - 10https://gerrit.wikimedia.org/r/451709 [20:20:08] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:20:09] (03CR) 10Ottomata: [V: 032 C: 032] mjolnir - Vary libsnappy package on debian version [puppet] - 10https://gerrit.wikimedia.org/r/451709 (owner: 10Ottomata) [20:20:29] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:20:38] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational [20:20:39] RECOVERY - Check systemd state on elastic2009 is OK: OK - running: The system is fully operational [20:20:39] RECOVERY - Check systemd state on elastic2004 is OK: OK - running: The system is fully operational [20:20:49] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational [20:21:07] ottomata: i want to merge this on eventbus (included on kafka) it looks like i'm touching firewall, but it's "nothing".. just about code refactor: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450317/1/modules/role/manifests/eventbus/eventbus.pp [20:21:48] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational [20:22:09] RECOVERY - Check systemd state on elastic2019 is OK: OK - running: The system is fully operational [20:22:32] (03PS1) 10BBlack: varnish-backend-restart: wipe nvme files too [puppet] - 10https://gerrit.wikimedia.org/r/451711 [20:22:49] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:23:02] (03CR) 10Ottomata: [C: 031] eventbus: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450317 (owner: 10Dzahn) [20:23:04] mutante: +1 [20:23:19] thanks :) [20:23:22] I think the main train deployed finished up quickly some time ago, right? [20:23:29] RECOVERY - Check systemd state on elastic2013 is OK: OK - running: The system is fully operational [20:23:29] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:23:40] (03PS2) 10Dzahn: eventbus: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450317 [20:23:48] RECOVERY - Check systemd state on elastic2007 is OK: OK - running: The system is fully operational [20:23:52] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@8c6a7f8]: add python snappy library [20:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:18] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:24:19] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:24:39] RECOVERY - Check systemd state on elastic2021 is OK: OK - running: The system is fully operational [20:24:54] (03CR) 10BBlack: [C: 032] varnish-backend-restart: wipe nvme files too [puppet] - 10https://gerrit.wikimedia.org/r/451711 (owner: 10BBlack) [20:25:09] RECOVERY - Check systemd state on elastic2017 is OK: OK - running: The system is fully operational [20:25:09] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [20:25:18] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational [20:25:18] RECOVERY - Check systemd state on elastic2023 is OK: OK - running: The system is fully operational [20:25:39] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [20:25:44] twentyafterfour: train is done with or still ongoing? [20:25:49] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational [20:25:59] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational [20:26:01] bblack: done. it was uneventful today [20:26:08] RECOVERY - Check systemd state on elastic2024 is OK: OK - running: The system is fully operational [20:26:18] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:26:19] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libsnappy1v5] [20:26:34] twentyafterfour: ok thanks! [20:26:38] np [20:26:48] PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:26:48] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:27:09] RECOVERY - Check systemd state on elastic2002 is OK: OK - running: The system is fully operational [20:27:18] RECOVERY - Check systemd state on elastic2012 is OK: OK - running: The system is fully operational [20:27:18] RECOVERY - Check systemd state on elastic2008 is OK: OK - running: The system is fully operational [20:27:18] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational [20:27:28] RECOVERY - Check systemd state on elastic2018 is OK: OK - running: The system is fully operational [20:27:28] RECOVERY - Check systemd state on elastic2001 is OK: OK - running: The system is fully operational [20:27:29] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational [20:27:38] RECOVERY - Check systemd state on elastic2010 is OK: OK - running: The system is fully operational [20:27:38] RECOVERY - Check systemd state on elastic2011 is OK: OK - running: The system is fully operational [20:27:48] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational [20:27:49] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:28:08] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational [20:28:08] RECOVERY - Check systemd state on elastic2003 is OK: OK - running: The system is fully operational [20:28:28] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:29:19] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:29:19] RECOVERY - puppet last run on elastic2023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:29:38] RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:30:09] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:30:49] RECOVERY - cassandra-a service on restbase1016 is OK: OK - cassandra-a is active [20:31:09] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:31:09] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:31:18] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@8c6a7f8]: add python snappy library (duration: 07m 25s) [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:36] so! if nobody has anything else pressing [20:31:53] jouncebot: next [20:31:53] In 2 hour(s) and 28 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T2300) [20:32:38] we have some network changes to make in eqiad to try to correct the issues that cropped up yesterday (not a complete fix, but at least a saner state to sail through the weekend) [20:32:38] RECOVERY - Check systemd state on elastic2015 is OK: OK - running: The system is fully operational [20:32:39] RECOVERY - Check systemd state on elastic2014 is OK: OK - running: The system is fully operational [20:33:06] if nothing else is critical, probably best just to stop all other deploys/commits/work while this happens to make failures clearer and more-isolated [20:33:08] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational [20:34:08] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:35:17] since those k8s latencies have been on and off for a while, I'm going to assume we can continue ignoring them [20:35:33] (03CR) 10Dzahn: [C: 032] eventbus: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450317 (owner: 10Dzahn) [20:35:40] (03PS3) 10Dzahn: eventbus: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450317 [20:36:08] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:36:39] (03PS1) 10Mforns: Add cron job to create and rotate EventLogging salts [puppet] - 10https://gerrit.wikimedia.org/r/451780 (https://phabricator.wikimedia.org/T199899) [20:37:09] RECOVERY - Check Varnish expiry mailbox lag on cp1076 is OK: OK: expiry mailbox lag is 0 [20:40:32] https://phabricator.wikimedia.org/T201145#4492126 [20:40:37] XioNoX: we can just coordinate here for better vis [20:40:47] so going to start with disabling the links, watching logs, alerts, multicast, then enabling the other links and monitoring the same [20:41:00] ok [20:41:34] log the timestamps too pls so we can confirm which things cause which (if any!) later [20:41:39] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:41:39] yeah [20:41:53] the main question is, if it gets unstable during the transition, should we try to finish it all, or rollback directly? [20:43:01] I think if there looks to be minor problems after the 2x disables, we can try to push through with the enables for fixing [20:43:07] rgr [20:43:07] that would be my best guess anyways [20:43:40] yeah [20:44:12] !log disable fpc3-fpc5 and fpc5-fpc7 - T201145 [20:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:19] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [20:45:08] RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:45:08] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:45:42] So far logs look alright [20:45:51] nothing crazy elsewhere that I see, yet [20:46:09] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:46:21] (03Abandoned) 10MarcoAurelio: Increase password policies for 'steward' to max [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440834 (https://phabricator.wikimedia.org/T197577) (owner: 10MarcoAurelio) [20:49:07] so start with bringing up 1-3? [20:49:23] yep [20:49:57] !log enable fpc1-fpc3 - T201145 [20:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:04] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [20:50:29] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:50:49] fyi, Telia is repairing our wavelength fiber somewhere on the East Coast right now [20:51:19] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:51:20] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Krenair) Hopefully this is the right place for my questions (sorry if not): So I'd like to get rid of th... [20:51:22] logs still quiet [20:52:01] yup! [20:52:03] mutante: yeah, the backup eqiad-codfw is down again, no big deal so far [20:52:52] let's aim for 5 minutes between changes as long as things stay stable. it's just long enough that if we have to go back retroactively, it's easy to tell effects apart [20:52:57] so :54? [20:54:01] sounds good! [20:54:55] !log enable fpc3-fpc4 - T201145 [20:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:39] done and confirmed up [20:56:14] stashbot is getting lazy [20:56:15] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:56:51] er, rancid is going to make some noise in the logs between :00 :05 [20:57:11] ok [20:57:14] bblack: I see FPC3 disconnect/reconnect logs [20:57:20] (empty FPC) [20:57:49] repeated you mean, or expected ones from the link changes? [20:58:55] only saw it once so far, but it was after the link change logs, not expected [20:59:40] ok [20:59:50] Aug 9 20:56:01 asw2-a-eqiad fpc3 BULKGET: Master socket closed [20:59:52] ^ that stuff? [20:59:54] yeah [21:00:01] (03CR) 10Alex Monk: "The hosts have been shut off for a week, let's go?" [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [21:00:16] I'd say keep trucking for now, things still seem sane [21:00:31] yeah, actually that's the same ones that have been showing up every ~10min [21:01:14] !log enable fpc5-fpc6 - T201145 [21:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:21] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [21:01:43] and confirmed link up [21:02:09] manually seeing if the show virtual-chassis link lists match what we expect heh [21:03:29] there's an fpc4-fpc6 connection that's on the old diagram and not the new, maybe just left out by accident [21:03:32] it's live [21:04:08] er, yeah [21:04:36] will fait for :06 to disable [21:04:39] wait* [21:04:41] Message from syslogd@asw2-a-eqiad at Aug 9 21:04:13 ... [21:04:41] asw2-a-eqiad fpc4 CMLC: Going disconnected; Routing engine chassis socket closed abruptly [21:04:47] er asw2-a-eqiad fpc4 CMLC: Going disconnected; Routing engine chassis socket closed abruptly [21:05:12] I can disable fpc4-fpc6 now [21:05:28] PROBLEM - Host cp1075 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:39] PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:44] ok, so disabling fpc4-fpc6 is what brings us all the way to codfw's config? [21:05:49] (was just missing in cmds list?) [21:06:00] obviously, we have some row A issues now, but let's try it if that's all that's left [21:06:01] correct [21:06:14] !log disable fpc4-fpc6 - T201145 [21:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:36] done [21:06:58] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but pooled: ores_8081: Servers ores1001.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1007.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1003.eqiad.wmnet are marked down but pooled: logstash-json-tcp_11514: Servers logs [21:06:58] t are marked down but pooled: kibana_80: Servers logstash1007.eqiad.wmnet are marked down but pooled [21:06:59] somewhere around :04/:05 there, there was a 503 spike, presumably from cp107[56] going offline, or related [21:08:01] they may have crashed ethernet drivers like before, not sure yet, will look slightly later [21:08:17] only those two alerting so far? [21:08:29] yeah, and they're in the same rack/phy-switch, but different clusters [21:09:00] there's some wdqs alert too, but I can't be sure it's not a secondary fallout yet [21:09:27] kibana and ores too [21:09:34] the logs are relatively quiet, only those two that have been there since before: [21:09:34] Aug 9 21:08:36 asw2-a-eqiad /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/50 [21:09:34] Aug 9 21:08:36 asw2-a-eqiad /kernel: tcp_timer_keep: fport/lport: 58316/7000 faddr/laddr: 128.0.0.19/128.0.0.1 [21:09:54] PROBLEM - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:17] multicast is quiet [21:10:25] this just paged [21:10:45] I can ping it from bast [21:10:58] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:01] but curl hangs [21:11:04] most likely kibana/ores/etc .svc. alerting is all related to lvs1016 issues [21:11:08] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:18] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:18] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:19] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:20] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1003.eqiad.wmnet, ores1001.eqiad.wmnet]) [21:11:28] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:28] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:28] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:28] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:29] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:38] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:38] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:38] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:39] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:39] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:39] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:48] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:48] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:48] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:48] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:11:54] RECOVERY - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 3511 bytes in 0.004 second response time [21:11:54] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:55] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:11:55] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:09] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:12:18] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:18] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:19] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1076_v4, cp1076_v6 [21:12:19] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:19] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:20] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:27] cp1075 still has ethernet link, but still unreachable [21:12:28] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:12:28] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 30 connecting: cp1075_v4, cp1075_v6 [21:12:28] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:12:28] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:12:30] so it's not a driver crash [21:12:38] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1076_v4, cp1076_v6 [21:12:44] shows no lldp neighbor though [21:12:48] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:12:48] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:12:48] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1076_v4, cp1076_v6 [21:12:49] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:12:49] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:12:59] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:13:08] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:13:08] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1076_v4, cp1076_v6 [21:13:08] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1076_v4, cp1076_v6 [21:13:08] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:13:10] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1075_v4, cp1075_v6 [21:13:27] looking [21:14:01] bblack: could all those alerts related to cp1075 or not? [21:14:11] all the ipsec are [21:14:14] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:18] or it has to be several hosts failing? [21:14:19] I'm not sure about lvs1016/kibana/ores, likely separate [21:14:30] which switch is cp1075+6 on? [21:14:36] or which FPC I mean [21:14:58] I don't even see them in the interface list anymore, whichever FPC it is, it's kinda gone? [21:14:58] bblack: both A4 fpc4 [21:15:11] asw2-a-eqiad fpc4 CMLC: Going disconnected; Routing engine chassis socket closed abruptly ... [21:15:23] right [21:15:24] PROBLEM - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:29] so FPC4 is borked right now [21:15:50] can't even show xe-4/* interface completions on the CLI [21:16:07] bblack: so those 3 hosts are on fpc4: [21:16:07] xe-4/0/7 lvs1016:enp4s0f1 {#3917} [21:16:07] xe-4/0/11 cp1075 [21:16:07] xe-4/0/13 cp1076 [21:16:10] at least cp107[56] is that, possibly the lvs1016/ores/kibana stuff is too [21:16:27] so that's the only real core problem right now [21:16:36] how do we get fpc4 back to sanity and/or what's confusing it? [21:16:56] it's the same issue we had with fpc5, which caused that task to be open [21:17:01] right [21:17:02] bblack: is lvs1016 the active or the passive one? if active, could we switch to the other one? [21:17:18] volans: it's active, and we could [21:17:23] RECOVERY - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 3657 bytes in 0.004 second response time [21:17:38] I don't think we ceck for interfaces health other than the primary one [21:17:43] check* [21:17:48] volans: 1006 would take over, if you stop pybal on 1016 [21:17:58] but it just recovered maybe [21:18:31] XioNoX: so the last step we took, was the extra undocumented one killing the fpc4-6 link to bring it into full sync with codfw layout [21:18:40] and nothing was failing until then [21:18:46] we could try turning that one back on [21:18:47] it was failing right before [21:18:50] oh? [21:19:05] right [21:19:42] fpc4 disconnect on 21:04:13, last link disabled at :07 [21:20:02] ok [21:20:04] so we can try to disable the one before [21:20:13] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.32 and port 80: No route to host [21:20:19] disable fpc5-fpc6 [21:20:42] ok [21:20:44] try it now [21:21:14] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.020 second response time [21:21:26] I'm going to stop pybal on 1016 to minimize damage [21:21:35] we can keep track of fpc4 state easily enough without it [21:21:45] !log stop pybal on lvs1016 (should move services to lvs1006) [21:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:50] !log disable fpc5-fpc6 - T201145 [21:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:57] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [21:22:23] got another one right after: asw2-a-eqiad fpc4 CMLC: Going disconnected; Routing engine chassis socket closed abruptly [21:22:24] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 [21:22:26] yeah [21:22:38] the various .svc.eqiad.wmnet should recover now from the pybal stop, but unrelated to switches [21:22:55] cp107[56] are the canaries for fpc starting to work right [21:23:03] and again [21:23:07] (the clusters can live without those 2x cps for now and be fine) [21:23:30] hey [21:23:52] hi! [21:24:03] * mark reading [21:24:25] basically things went kinda ok until sometime around the last couple of link-enables [21:24:49] then FPC4 started flapping on RE connectivity and we lost the servers hooked up to it (cp107[56], and one of lvs1016's critical ports) [21:24:55] logs are quiet ish, but still no fpc4 [21:24:55] right [21:24:58] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [21:25:02] the paging fallouts are from the lvs1016 part, I just failed that over to lvs1006 to fixup for now [21:25:04] so is fpc4 flapping now? [21:25:16] mark: it's curently down [21:25:24] it's periodically doing this to switch sessions: [21:25:25] Message from syslogd@asw2-a-eqiad at Aug 9 21:22:13 ... [21:25:25] asw2-a-eqiad fpc4 CMLC: Going disconnected; Routing engine chassis socket closed abruptly [21:25:32] and we can't reach any ports on it [21:25:47] have we tried rebooting that one? [21:25:51] or any one like it before [21:25:59] just to reset its state in case it helps [21:26:01] I tried rebooting fpc5 previously [21:26:05] ok [21:26:17] (wen it was fpc5 having an issue) [21:26:18] I think right now we're basically-healthy in terms of real services, even with FPC4 dead, so long as I keep lvs1016 pybal disabled [21:26:19] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=42) [21:26:28] I can try to reboot fpc4 [21:26:51] Aug 9 21:26:14 asw2-a-eqiad l2ald[3275]: L2ALD_DUPLICATE_CONNECTIONS: Duplicate connection for peer DPC-4 (Ident: 4). Disconnecting... [21:26:53] XioNoX: if you're going to try that, I'd put the links back in the codfw-like state first [21:26:56] duplicate hm [21:27:08] (as in, undo any undos we did at the end there) [21:27:32] ok [21:27:41] mark: 4 is connected to 3 and 7 [21:27:53] XioNoX: which means just renable fpc5-fpc6 I think? [21:28:10] bblack: correct [21:28:35] I can try to disable fpc4-fpc3 link (leaf to leaf), maybe it's what is confusing the VC [21:30:29] 4-3 was the last thing we did before the first wierd syslogs showed up anyways [21:30:32] worth a shot! [21:31:29] try disable 4-3, see if that makes fpc4 behave. [21:31:41] yeah [21:31:43] !log disable fpc3-fpc4 - T201145 [21:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:49] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [21:31:52] done [21:31:59] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 36 ESP OK [21:31:59] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 36 ESP OK [21:31:59] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 36 ESP OK [21:31:59] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 36 ESP OK [21:31:59] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 36 ESP OK [21:31:59] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 32 ESP OK [21:32:00] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 36 ESP OK [21:32:00] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 36 ESP OK [21:32:01] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 32 ESP OK [21:32:03] oh look at that [21:32:05] eh [21:32:08] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 36 ESP OK [21:32:08] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 32 ESP OK [21:32:08] RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [21:32:09] RECOVERY - Host cp1075 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [21:32:09] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [21:32:09] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 32 ESP OK [21:32:09] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 32 ESP OK [21:32:09] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK [21:32:10] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [21:32:10] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [21:32:11] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [21:32:11] at least it's quick [21:32:12] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [21:32:18] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 32 ESP OK [21:32:18] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [21:32:19] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 36 ESP OK [21:32:28] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [21:32:29] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 32 ESP OK [21:32:29] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 32 ESP OK [21:32:29] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 36 ESP OK [21:32:29] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 32 ESP OK [21:32:29] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [21:32:29] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 32 ESP OK [21:32:32] try turning 5-6 back on, so at least we're only one link off from the codfw config? [21:32:38] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 36 ESP OK [21:32:38] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 32 ESP OK [21:32:38] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 36 ESP OK [21:32:38] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 36 ESP OK [21:32:40] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 32 ESP OK [21:32:40] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 32 ESP OK [21:32:40] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 32 ESP OK [21:32:48] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [21:32:48] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 36 ESP OK [21:32:48] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 36 ESP OK [21:32:48] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 36 ESP OK [21:32:48] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 32 ESP OK [21:32:49] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 36 ESP OK [21:32:49] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [21:32:54] bblack: yep [21:32:58] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [21:32:58] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK [21:32:58] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 36 ESP OK [21:32:58] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 36 ESP OK [21:33:26] not sure if going to be an issue but saw: [21:33:26] Message from syslogd@asw2-a-eqiad at Aug 9 21:31:10 ... [21:33:26] asw2-a-eqiad fpc4 CMLC: Going disconnected; Routing engine chassis socket closed abruptly [21:33:26] Message from syslogd@asw2-a-eqiad at Aug 9 21:32:16 ... [21:33:26] asw2-a-eqiad fpc4 PFEMAN: Shutting down in 5 seconds, PFEMAN Resync aborted! No peer info on reconnect or master rebooted? [21:33:28] PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:29] PROBLEM - Host cp1075 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:31] fpc4 is still gone [21:33:35] lol [21:33:43] it was back momentarily at least [21:33:45] back now [21:34:06] it rebooted itself? or just restarted some daemons? [21:34:25] Current time: 2018-08-09 21:34:17 UTC [21:34:25] System booted: 2018-02-12 21:12:21 UTC (25w3d 00:21 ago) [21:34:25] Last configured: 2018-08-09 21:33:51 UTC (00:00:26 ago) by root [21:34:25] 9:34PM up 178 days, 22 mins, 0 users, load averages: 0.35, 0.19, 0.08 [21:34:29] that's member 4 [21:34:31] fpc4 [21:34:40] ok [21:34:53] the xe-4 interfaces are back again [21:34:55] fpc4 is back [21:34:57] (03PS1) 10RobH: decom of hydrogen/chromium [puppet] - 10https://gerrit.wikimedia.org/r/451789 (https://phabricator.wikimedia.org/T201522) [21:35:00] yeah, for how long, suspens [21:35:08] RECOVERY - Host cp1075 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:35:09] RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [21:36:23] my current vote would be let's see if it will stabilize for >5m. if it does, then let's try turning on fpc5-6 [21:36:49] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) I talked to @mark today. Here's what I understood from the conversation: 1. All of the followi... [21:37:17] asw-a-eqiad is only connected to asw2's 2/7 (spine) right? [21:37:31] bblack: correct [21:37:43] (03CR) 10RobH: [C: 032] decom of hydrogen/chromium [puppet] - 10https://gerrit.wikimedia.org/r/451789 (https://phabricator.wikimedia.org/T201522) (owner: 10RobH) [21:38:12] (03PS1) 10RobH: decom of hydrogen and chromium [dns] - 10https://gerrit.wikimedia.org/r/451790 (https://phabricator.wikimedia.org/T201522) [21:38:36] (03CR) 10jerkins-bot: [V: 04-1] decom of hydrogen and chromium [dns] - 10https://gerrit.wikimedia.org/r/451790 (https://phabricator.wikimedia.org/T201522) (owner: 10RobH) [21:38:51] fpc5-fpc6 is the same in term of location as fcp3-fpc4 [21:39:13] (03CR) 10Krinkle: [C: 031] "Yeah, same here. All good :)" [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [21:39:25] damn thats a lot of refeernces [21:39:45] references? [21:40:05] (03PS2) 10RobH: decom of hydrogen and chromium [dns] - 10https://gerrit.wikimedia.org/r/451790 (https://phabricator.wikimedia.org/T201522) [21:40:08] a cname to ntp [21:40:16] and in wikimedia.org so its aliased to dozens of other zonefiles [21:40:17] heh [21:40:24] XioNoX: and a bunch of others really, it's basically the same as any of the other leaf<->leaf [21:40:26] so i ripped out the A entry of a cname reference [21:40:30] yeah [21:40:37] sorry, i was commenting on my patchset ;] [21:40:46] well, I take that back [21:40:51] well, leaf-leaf between 2 spins [21:40:59] some leaf<->leaf are between leaves on the same spine switch, some are cross two spines [21:40:59] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:41:05] (03CR) 10RobH: [C: 032] decom of hydrogen and chromium [dns] - 10https://gerrit.wikimedia.org/r/451790 (https://phabricator.wikimedia.org/T201522) (owner: 10RobH) [21:41:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:41:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:41:24] 1-3 and 6-8 are single-spine [21:41:40] 5-6, 3-4, and 1-8 are cross-spine [21:41:41] what is that alert? [21:41:53] fpc4 is still up [21:41:53] it may be latent [21:41:56] ok [21:42:09] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [21:42:31] hmmm [21:42:35] we do have some fresh 503 though [21:42:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:42:55] let's cut that link again [21:42:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:42:59] and sacrifice fpc4 [21:43:18] the alerts make it sound worse than it is [21:43:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:43:29] let's wait just a sec, it's not much 503 yet and hard to tell why [21:43:32] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [21:43:47] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) This tentative list is great news from my perspective, and I would have an easy time following it... [21:43:51] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) a:03Cmjohnson [21:44:08] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:44:09] is there a source/dest we can try to see if we have packet loss? [21:44:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:44:32] large spike in traffic on cr1-eqiad:ae1 [21:44:39] multicast? :) [21:44:42] no [21:44:45] multicast is quiet [21:45:11] where do you see the spike? [21:45:13] the 503s are here, it wasn't very large or for very long: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=now-1h&to=now [21:45:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:45:32] it was brief I think [21:45:33] https://librenms.wikimedia.org/device/device=160/tab=port/port=14308/ [21:46:00] ok [21:47:16] so, right now, the connection between fpc1<->fpc8 is the only thing keeping the two spines' sets from being isolated from each other, with 5-6 and 3-4 gone [21:47:38] and is thus also carrying any/all traffic going between the two sides [21:48:16] fpc3 flapped a minute ago [21:48:43] ok, that might be the source of the 503s and/or the traffic spike, if fpc3 hosts were awol briefly [21:48:59] oh, there are no fpc3 hosts heh [21:49:16] (03PS1) 10Jon Harald Søby: Add mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451791 (https://phabricator.wikimedia.org/T200152) [21:49:17] wasn't fpc3 empty yesterday? [21:49:33] yeah it's empty now, going by port descr [21:49:39] (fyi my dns change was fixing already broken ntp otherwise im respecting the topic!) [21:49:42] ge-3/0/32 down down cloudservices1004 [21:49:50] ge-3/0/35 down down wmf7433-spare [21:50:09] on asw2-a? [21:50:19] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [21:50:28] (03PS1) 10Jon Harald Søby: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451792 (https://phabricator.wikimedia.org/T200152) [21:50:32] yes [21:50:38] oh I do see those now [21:51:09] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:52:11] we can try to enable fpc5-fpc6 [21:52:16] I'm trying to come to some kind of reasonable theory [21:52:34] as to why 3<->4 is any different from 5<->6 and/or 1<->8 [21:53:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:53:48] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:53:55] XioNoX: ok try 5-6? [21:54:00] also maybe related or not, but in codfw fpc2 is backup, fpc7 master, in eqiad it's the opposite [21:54:10] not sure if it's possible to fail it over though [21:54:21] it's also pretty symmetrical, the layout [21:54:31] so [21:54:43] why would we continue trying adding more links? [21:54:50] we've already tried the codfw layout and it broke, right? [21:55:40] yeah, it started breaking even before getting to the 100% codfw [21:55:46] yes, but the buts would be: it may be that it could work, but there's some temporary state in the way. and (b) there's only a single 10g link holding the two halves of this switch stack together currently. [21:55:47] indeed [21:56:03] 40G, but yeah [21:56:04] 40G [21:56:11] err right [21:56:22] yeah [21:56:45] it seems like rotating that diagram differently would be illuminating (not trying to treat fpc4 as being part of the spine row, and laying out all the lines symmetrically) [21:56:47] even if we can get it to work, i'm not sure what that will tell us [21:57:01] agreed, fpc4 is not a spine [21:57:12] because there's actually a slight asymmetry to it, even when you try to straighten it out [21:59:12] with 5-6 and 3-4 down, it's symetical [21:59:42] well yeah :) [22:00:11] because there's only one link between the halves, and on both sides it's from the same kind of leaf (one that's connceted sideways to another local leaf) [22:01:14] the 3-4 and 5-6 links are unique in that they connect across the halves, but they go from a well-connected leaf on one side (has a sideways connection to another local leaf) to a less-connected leaf on the other (just has connection to its spine) [22:01:59] maybe whether or not the codfw-config works depends on what order the links initially come online in when it's first all hooking up heh [22:02:11] yes, that is making this very scary [22:02:50] anyways, right now we think everything is reachable, and we just have less-than-ideal interswitch link redundancy, right? [22:02:56] yes [22:03:03] for a handful of hosts :P [22:03:10] granted, cp hosts... [22:03:23] all the new caches (which we just migrated eqiad to) are in the new stacks, yeah :) [22:03:47] yeah so one thing i'm considering [22:03:59] maybe we could get a single spare QFX connected to the old stack [22:04:18] no VCF whatsoever, just a single switch [22:04:23] hook up 10G hosts that need to be there [22:04:31] and in the mean time we experiment with the new switch stack [22:04:47] way less risky? [22:04:50] well right now, all the uplinks to the routers are on the new stack too, we'd need to move those back to the old [22:04:58] yes, that's easy though [22:05:06] not all uplinks [22:05:06] just one [22:05:08] at his point I don't know what experimentation we can/should do (labs or not) [22:05:34] XioNoX: migrate to a proper VCF [22:05:41] I will say, the POV I'm coming around to right now, is that link-redundacy is way less important risk-wise than angering the juniper gods [22:05:50] indeed. [22:05:52] yeah [22:05:52] if we can make things more stable removing redundant links, it's good [22:06:09] so it's probably stable enough now for a few days [22:06:14] and we could take a few days to migrate off hosts [22:06:18] and then it's an empty stack [22:06:20] and we can do whatever [22:06:26] yeah [22:06:33] and this is just one row of course [22:06:39] i don't know why it's not an issue for the other(s), or codfw [22:06:41] that's scary to think about [22:06:51] row B is having issues that could be related [22:06:56] almost certainly [22:07:05] why not the others? [22:07:06] row C has 1 less member, so maybe that's why? [22:07:08] different software? [22:07:16] could be [22:07:27] maybe it's because 3*QFX vs 2*QFX [22:07:42] 2x QFX where? [22:07:48] that sounds likely [22:07:55] codfw/eqiad-D [22:08:25] ok [22:08:26] is codfw still my original layout? [22:08:33] oh just row D [22:08:41] so for the immediate term, we're all in agreement just freeze it here for the weekend? [22:08:46] yeah [22:08:53] XioNoX: yeah to what? [22:08:56] I'm going to turn lvs1016 pybal back on too [22:09:05] yeah, to freeze [22:09:22] and yeah to I haven't changed codfw layout, so most likely your original [22:09:29] !log re-enabling pybal on lvs1016 [22:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:41] XioNoX: codfw row A looks original [22:09:44] any that changed? [22:09:47] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [22:10:07] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [22:10:36] because if not, i don't understand why we're calling this experiment "the codfw layout" [22:10:36] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [22:10:37] mark: I based my diagram on codfw-A, haven't look at the other codfw but I would assume they're the same (I haven't change them) [22:10:51] what diagram now? [22:10:59] codfw row A has 2x QFX, so it's already different [22:11:26] mark: this diagram https://phabricator.wikimedia.org/T201145#4492126 [22:11:26] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42) [22:11:38] ok [22:11:51] yes, the top one looks original [22:11:56] except the QFX [22:11:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:11:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:11:58] and the disabled links [22:12:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:12:13] (03PS6) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [22:12:36] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:12:50] yeah, codfw has 2*QFX and eqiad has 3*QFX, we focused on the link topology hoping that duplicating codfw-A to eqiad-A would solve the issue [22:13:11] hi i just got [22:13:11] Request from 2a00:23c4:ad14:9700:e558:3376:e9bd:2006 via cp1075 cp1075, Varnish XID 564265214 [22:13:12] Error: 503, Backend fetch failed at Thu, 09 Aug 2018 22:13:00 GMT [22:13:21] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [22:13:25] yeah [22:13:34] I see a few 503s, I don't know if it's transient [22:14:08] Aug 9 22:12:53 asw2-a-eqiad fpc3 BULKGET: Master socket closed [22:14:48] ok, so maybe we're not out of the woods yet [22:15:05] stable-er, but still a bit lossy? [22:15:24] fpc3 doesn't have anything so that might be why we thought it was good [22:15:34] indeed [22:15:37] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:15:46] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [22:15:48] so other than reenabling pybal [22:15:50] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-1h&to=now [22:15:56] we haven't changed anything? [22:16:06] & yeah but the on/off 503 spikes there, are even before turning pybal back on [22:16:22] seems to just be coming and going in a pattern, probably small loss... [22:16:26] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:16:43] it's on the order of ~0.1% when it peaks so far [22:16:47] (of reqs, failing) [22:16:50] do you want me to go back and remove any links? [22:17:16] cmjohnson1: we want you to go back and put everything on the old switches again [22:17:19] but not today ;) [22:17:22] cmjohnson1: I don't think so [22:17:24] (hopefully) [22:17:37] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:17:42] okay [22:18:06] that FPC3 disconnect happen every 8min or so [22:18:34] so in our "stable" situation we ended up in last night [22:18:37] we also had fpc3 flapping, right? [22:18:44] but no side effects from that? [22:18:49] correct [22:18:49] slightly different topology though [22:19:30] those small bursts of 503s, they're all tracing on oxygen as coming from cp1075 unable to reach services [22:19:37] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:19:49] bblack: do you have an example of service? [22:19:52] so I think asw2-a fpc4 or whole-stack in general, still has some minor issues [22:19:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:20:04] XioNoX: it's non-specific, the problem is cp1075, not the other things [22:20:22] yeah, want to check if it's packet loss, or nothing at all, etc.. [22:20:37] or same issue as row B [22:20:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:21:01] try cp1075 to appservers.svc [22:21:34] I'm pinging there now, just a simple ping, but not seeing anything wrong [22:21:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:22:18] there's definitely an on/off pattern to it: [22:22:20] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-1h&to=now [22:22:37] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:22:42] the peaks are 5m apart [22:22:56] :04 :09 :14 :19 [22:23:00] next will be :24 [22:23:59] 21:47:16 [22:23:59] 21:55:48 [22:23:59] 22:04:20 [22:23:59] 22:12:53 [22:23:59] 22:21:25 [22:23:59] those are fpc3 disconnect [22:24:14] so I guess not related [22:24:19] so when fpc3 disconnects, probably it disturbs the rest of the asw2 fabric briefly [22:24:20] well [22:24:25] ? [22:24:29] look at when the 503s start, not when the peak is [22:24:41] well it's so short, it's hard to see that in the graph [22:24:44] so 22:12:53 more or less matches up with 22:13 [22:24:50] yeah [22:25:37] so fpc3 has like nothing on it right [22:25:44] can we cut the link between 1 and 3 [22:25:49] one option would be to eliminate fpc3 [22:25:50] yeah [22:25:55] the first non-zero samples (only 1min res) are all 1-min earlier than I said before [22:26:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:26:10] so :03 :08 :13 :18 :23 [22:26:48] the 2 servers on fpc3 are descriptions only. Nothing is setup for them yet. [22:26:57] ok [22:27:21] cmjohnson1: so, we don't want to be adding anything new onto the new switches for now [22:27:26] no new servers onto them [22:27:40] ok [22:27:42] for row A and B at least [22:27:46] if you're going to break 1-3, it seems like 6-8 is basically the same thing [22:27:48] they're unstable [22:28:09] it would get us to all single-connected stuff other than the 1-8 link that bridges the two fpc islands [22:28:27] so what might be different [22:28:34] is that fpc3 was flapping yesterday [22:28:39] but it may have been inconsequential for traffic path [22:28:45] right [22:28:46] it never was on a preferred path [22:28:47] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:28:48] and now it may be? [22:29:11] it could be getting used for some of fpc1's traffic I guess [22:29:39] although i can't imagine why that would be smart, but the switch might do it heh [22:29:49] i don't really see it either but who knows [22:29:53] yeah, doesn't make sens, but at this point... [22:29:58] fpc1 only has dns1001 [22:30:06] and... fpc8 [22:30:12] right, that [22:30:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:30:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:30:14] heh [22:30:31] break 3-1 and see? [22:30:38] let's try [22:30:41] I have the command ready [22:30:51] maybe we can only see one problem at a time, and after we break 3-1 we'll see a problem on either 6 or 8 and need to break 6-8 :) [22:30:55] telling users to stop browsing wikipedia every 5 mins seems nonideal anyway ;p [22:31:09] all agreed to break 3-1? [22:31:14] !log disable fpc1-fpc3 - T201145 [22:31:15] yeah [22:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:21] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [22:31:33] ok [22:31:35] now we wait... [22:31:52] the 503 rate is pretty tiny, but obviously it will be seen more by logged-in than anons [22:33:33] (03PS7) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [22:35:02] the last one (503 spike) before the disable was much smaller and truncated [22:35:13] and the final one, the disable was right in the middle of it but it was tiny so far [22:35:31] will take a few mins to see if it goes away completely [22:36:20] unrelated, but logs say xntpd[4859]: NTP Server 208.80.154.50 is Unreachable [22:36:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:36:28] that's the chromium/hydrogen being decommed [22:36:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:36:41] I thought we fixed the network devices first [22:36:42] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:37:18] bblack: I think I skipped unrelated changes on that switch [22:37:21] ok [22:37:24] (re NTP) [22:37:42] it's not a big deal, we have many NTP configured [22:38:11] looks pretty good so far [22:38:51] so annoying that we don't have graphs of VCF links [22:38:59] (or do we these days?) [22:39:08] internally you mean? [22:39:09] would be so helpful to be able to see which ones actually get used, how much [22:39:10] yes [22:39:29] once you configure them as VCP they drop from SNMP [22:39:49] yeah, you can see paths via the command line, but nothing else [22:40:09] `show virtual-chassis vc-path ?` [22:40:34] i think you can get counters on the cli as well [22:41:07] fpc2: [22:41:07] -------------------------------------------------------------------------- [22:41:07] Interface Input Octets/Packets Output Octets/Packets Input Output [22:41:07] Util Util [22:41:07] vcp-255/0/48 7951027096828 / 22634785019 4944434658063 / 21690703873 0 0 [22:41:22] show virtual-chassis vc-port statistics [22:42:23] yeah, it's also exposed via snmp: https://apps.juniper.net/mib-explorer/search.jsp#object=jnxVirtualChassisPortTable&product=Junos%20OS&release=14.1x53-D30 [22:42:23] D30: separate scap3 and mediawiki sections in docs/scripts.rst - https://phabricator.wikimedia.org/D30 [22:42:38] https://phabricator.wikimedia.org/T201097 slighly related [22:42:45] but it's not in librenms [22:42:52] ha, that would be good to have [22:43:19] so aside from our other topology/best-practices issues (re spines-to-all-leafs, and no leaf<->leaf), I was noticing this one in particular: [22:43:40] BEST PRACTICE In a QFX5100 VCF, we recommend using the following QFX5100 switches as spine devices: [...] QFX5100-96S or QFX5100-48S, in deployments where devices are connecting to the VCF using 1-Gbps Ethernet interfaces only on the leaf devices. [22:44:09] well the [...] alternative I shouldn't have left out being: [22:44:10] QFX5100-24Q switches, in deployments where devices are connecting to the VCF using the 10-Gbps Ethernet interfaces on the leaf devices, or using a mix of 10-Gbps and 1-Gbps Ethernet interfaces on the leaf devices. [22:44:13] i also remember that you shouldn't make EX4300s spines in a mix with QFX [22:44:23] right, only QFX can be spines basically [22:44:32] or EX4300s in all EX4300 network [22:44:41] but seriously why would that matter in a proper implementation :S [22:44:51] but they're basically saying if we want QFX leaves with 10G ports, the spines should be QFX5100-24Q, not the -48S we're using [22:44:57] I'm wondering if that's just recommendations for performances, and not a blocker [22:45:13] it's not like they're explaining why of course [22:45:19] so this may be making a difference, for the rows that have a third QFX as a 10G leaf [22:45:47] (that was under Spine Devices section in https://www.juniper.net/documentation/en_US/junos/topics/concept/vcf-components.html ) [22:47:12] yeah, at this point the only difference between stable and not is either 2 vs 3 QFX per row (codfw/eqiad-D), or "5 vs. 4 EX" (eqiad C) [22:47:14] so basically: they do say a QFX5100 can be a leaf in a QFX5100-spined VCF. It's just they "recommend" if you're going to have QFX5100 leaves with 10G ports on them, we should have been using the -24Q variant for the spines [22:47:20] but it can also be unrelated [22:47:48] it seems more like that recommendation would just be about having the right bandwidth capacity in various places, though [22:47:52] but you never know [22:48:07] Aug 9 22:47:03 asw2-a-eqiad fpc3 BULKGET: Master socket closed [22:48:08] ... [22:48:15] any spike of 503s? [22:48:26] nope [22:48:42] maybe wait a min to see them though, if it just started again at :47 [22:48:42] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [22:48:48] maybe fpc3 needs a boot? [22:48:57] well fpc3 was flapping yesterday too [22:49:02] but we had no other issues [22:49:22] and at this point, i'll take it :P [22:49:32] leave it like that if no other issues ? [22:49:37] i would say so [22:49:39] hopefully! [22:49:45] and plan for monday [22:49:52] let's get stuff off of that stack [22:49:59] yeah [22:50:08] and perhaps B as well, dunno [22:50:19] B would be a mess [22:50:36] it may be tricky getting 10G ports for the 16x new eqiad caches, staying in the racks they're in [22:50:46] yes [22:50:51] but only few are in row A, right? [22:50:52] we can free up most of the old caches' 10G ports, but the old caches weren't distributed evenly in rows like these are [22:50:57] there are 18 servers on asw2-a-eqid [22:50:59] there's 4x per row [22:51:00] and maybe we can just run a few fibers [22:51:04] 4 fibers, that's nothing [22:51:07] ok [22:51:10] hopefully we have enough [22:51:13] 8 10G + 10 1G [22:51:22] if we wait for monday [22:51:26] so if we rack a QFX in the middle of the row [22:51:31] or something [22:51:37] and then run some temp stuff to it [22:51:37] I can put in the decom for all the other 10G cp10xx that are going away, and it will free up a bunch of ports too for options [22:51:43] it's like 18 ports freeing up [22:51:50] right that works too [22:52:54] so also, we were talking about the longer-term of building this switch stack more like juniper's supported topology and 40G port capacities for all the leaf->spine uplinks [22:53:10] either that [22:53:13] or we can go oldskool VC again [22:53:15] like eqiad was [22:53:19] and one novel option that came to mind, was to only use 40G uplinks for the "extra" leaf QFX that has 10G ports on it, and use 10G uplinks for the EXs [22:53:39] need to study the requirements/limitation of old VC [22:53:39] it may work yeah [22:53:46] XioNoX: rings only [22:54:12] it's not an optimal topology [22:54:13] but otoh [22:54:16] it has served us well since 2011 [22:54:19] is VC still supported, with modern software revs and the QFXes etc? [22:54:21] yes [22:54:23] ok [22:54:28] but you need to reboot the stack for it [22:54:29] :D [22:54:36] which we can do once we migrate stuff off it [22:55:32] since we have only 8 racks/switches it's not too bad [22:55:34] VCF becomes more important with more devices obviously [22:55:53] I wonder now [22:56:49] we could split rows too... have 2 and 7 still be our uplinks/"spines", but basically just do two smaller VC or VCF of 4 switches (1-2-3-4, 5-6-7-8), and then interconnect the two two VC[F]s with a 40G non-VC link [22:56:50] it may actually not be worse [22:57:09] and avoid having to construct an 8-switch topology at all [22:57:25] but I guess VCF at least requires 2x QFXs [22:57:31] maybe VC wouldn't [22:57:49] and I don't know if 2 smaller rings with an interconnect between their QFXs would actually be better than an 8-member ring [22:58:57] we need to list all the options and think about them properly, especially if we need to retrofit other rows [22:59:05] yeah [22:59:28] asw2-a can be a proper playground [22:59:36] and hope row B doesn't blow up [22:59:37] we could try go 90s style and turn off all the magic software, connect everything to everything and run spanning tree :) [22:59:46] (not serious!) [22:59:47] it's the one I'm most worries about [22:59:59] VC is actually not worse [23:00:00] https://drive.google.com/drive/u/0/folders/0By9f9UqxCyCQWkNVRF9SVzR4ZlU [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180809T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:17] jouncebot is wrong, I have a patch (actually two) [23:00:31] worst case is 2 hops away from an uplink [23:00:40] deploy or no? ^ [23:00:48] greg-g: you're clear I think [23:00:58] thanks, wasn't watching scrollback closely [23:01:36] mark: yeah, it means the QFX in the middle has to go through EX switches though [23:01:44] not a blocker, but not ideal [23:01:51] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: /srv 51047 MB (10% inode=99%) [23:01:54] they do 40G... [23:02:28] but this is why eqiad has all the 10G switches on one end of the row, right :) [23:02:42] in the old stacks [23:02:44] yeah :) [23:02:45] what was it, row D? [23:03:04] I have to step out, have my laptop/phone with me, back in 30min or so [23:03:08] thanks for the help! [23:03:11] i am going to bed [23:03:18] see you tomorrow :) [23:03:22] (hopefully :) [23:03:25] good night! [23:04:22] MaxSem, are you around? you were awake this time yesterday :) [23:05:34] also in a meeting, but can try multitasking [23:06:32] (03CR) 10MaxSem: [C: 032] Add mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451791 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [23:06:34] MaxSem, thx. :) it's a very small one [23:07:57] (03Merged) 10jenkins-bot: Add mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451791 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [23:09:30] (03CR) 10MaxSem: [C: 032] Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451792 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [23:10:12] PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:10:22] PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:10:52] RECOVERY - Disk space on elastic1026 is OK: DISK OK [23:10:52] (03Merged) 10jenkins-bot: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451792 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [23:12:45] !log maxsem@deploy1001 Synchronized static/images/mobile/copyright/wikivoyage-wordmark-ps.svg: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/451791/ (duration: 00m 52s) [23:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:21] RECOVERY - cassandra-a service on restbase1016 is OK: OK - cassandra-a is active [23:13:43] Jhs: pulled on mwdebug1002, please test [23:14:31] RECOVERY - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-a valid until 2020-06-24 13:01:14 +0000 (expires in 684 days) [23:15:06] checking [23:16:23] MaxSem, don't see the intended effect (change logo on https://ps.m.wikivoyage.org/ ) [23:16:47] MaxSem, BUT i don't think the config is wrong (so no need to revert), I just think there's something funky that I've overlooked with that Minerva skin thing [23:17:50] it seems to use "wikipedia-english" (instead of enwiki) and "wikipedia-devanagari", which are not used anywhere else in InitialiseSettings.php [23:18:21] so if it's OK with you, we can leave it there with the patch intact, and I'll try to figure out what needs to be done differently for tomorrow? [23:18:30] but of course, reverting is not a big deal either. up to you [23:19:08] (03CR) 10jenkins-bot: Add mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451791 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [23:19:10] (03CR) 10jenkins-bot: Set mobile wordmark for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451792 (https://phabricator.wikimedia.org/T200152) (owner: 10Jon Harald Søby) [23:20:37] Jhs: I see ويکيسفر [23:22:26] huh, for me it's -en.png [23:22:33] might be a caching issue? [23:22:57] didja check on the right host? [23:24:02] yeah, I even checked both 1001 and 1002 [23:24:20] try hard-refreshing a couple times [23:25:17] yes! now it's working [23:25:32] wee [23:25:47] Ctrl+F5 did it, i thought Ctrl+R was the same (it usually works) [23:27:06] !log maxsem@deploy1001 Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/451792/ (duration: 00m 51s) [23:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:16] Jhs: please test ^ [23:27:36] the .png it refers to in src= doesn't exist though, but the .svg in srcset= does. it shows correctly, at least in my bromser. is that a problem? [23:28:04] loosk correct, yes (Y) [23:28:44] so it needs a png [23:28:46] reverting [23:29:42] thx [23:30:12] (03PS1) 10MaxSem: Revert "Set mobile wordmark for pswikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451810 [23:30:17] (03CR) 10MaxSem: [C: 032] Revert "Set mobile wordmark for pswikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451810 (owner: 10MaxSem) [23:30:18] i think that's an issue with a couple more of the logos in that folder [23:30:36] i'll make PNGs of them and add in another patch [23:30:41] but let's do that tomorrow :) no rush [23:31:52] (03Merged) 10jenkins-bot: Revert "Set mobile wordmark for pswikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451810 (owner: 10MaxSem) [23:33:09] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/451810/ (duration: 00m 51s) [23:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:33] Jhs: reverted the config change but the svg is still there [23:33:38] great (Y) [23:34:32] (03CR) 10jenkins-bot: Revert "Set mobile wordmark for pswikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451810 (owner: 10MaxSem) [23:44:17] (03PS1) 10Jon Harald Søby: Add missing PNG files in mobile logo folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451813 [23:46:40] (03PS1) 10Volans: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) [23:47:34] (03CR) 10jerkins-bot: [V: 04-1] Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [23:49:16] (03PS2) 10Volans: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079)