[00:00:55] user{} and group{} in modules/scap/manifests/l10nupdate.pp has it hardcoded as 10002 [00:00:57] for uid and gid [00:01:18] Wonder why it didn't set on naos [00:02:17] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active [00:02:17] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active [00:02:22] user and group both set the gid [00:02:26] but user does not set uid [00:02:33] the gid was probably right [00:02:40] and didnt have to be fixed [00:03:09] adding uid might break beta though, because then UID conflicts with LDAP users potentially [00:03:21] 06Operations, 10Scap: Decide on /var/lib vs /home as locations of homedir for l10nupdate - https://phabricator.wikimedia.org/T163288#3192421 (10demon) [00:03:23] Ahhhhh [00:03:24] Ok [00:03:34] l10nupdate doesn't run on beta iirc [00:03:47] Since we do full scaps every ~15m [00:04:53] ok.. lets add " uid => 10002" to the user{} then [00:05:04] something tells me there was a reason for this though [00:05:17] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [00:05:17] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [00:05:25] it seems too obvious for not doing this in such a long time.. but you never know [00:05:49] let's just test it somewhere [00:05:56] (03PS1) 10Chad: l10nupdate: Ensure a default uid of 10002 [puppet] - 10https://gerrit.wikimedia.org/r/348884 [00:06:10] Should do it, but yeah lets test on beta first [00:06:44] yea, seems good. and i gotta look at tegmen first too [00:17:06] mutante: Cherry picked to beta, we'll see if it breaks shit [00:17:41] ldap also reports 10002, so hoping nothing does [00:20:23] Puppet ran fine on deployment-tin [00:22:26] (03PS1) 10BryanDavis: toollabs: iterate bigbrother job dict values not keys [puppet] - 10https://gerrit.wikimedia.org/r/348885 (https://phabricator.wikimedia.org/T163265) [00:24:53] (03CR) 10BryanDavis: toollabs: iterate bigbrother job dict values not keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348885 (https://phabricator.wikimedia.org/T163265) (owner: 10BryanDavis) [00:25:56] (03PS1) 10Dzahn: deployment: sync home dirs from mira to naos [puppet] - 10https://gerrit.wikimedia.org/r/348886 (https://phabricator.wikimedia.org/T162900) [00:26:49] mutante: I wonder if sync'ing homedirs between deployment masters is generally a good idea [00:27:36] Just always do it [00:28:41] last time i didnt it it didn't take long until somebody missed it [00:29:51] it seems to be one of these things where the team opinion is split 50/50. i am doing it mainly because the ticket has last open checkbox for that [00:31:13] (03CR) 10Andrew Bogott: [C: 031] toollabs: iterate bigbrother job dict values not keys [puppet] - 10https://gerrit.wikimedia.org/r/348885 (https://phabricator.wikimedia.org/T163265) (owner: 10BryanDavis) [00:31:22] Granted, I guess if you really want something to be everywhere you can check it into dotfiles, but could easily have done something on a deploy master that you don't need everywhere but would want if tin just suddenly died with no recovery [00:32:07] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational [00:32:17] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active [00:32:17] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active [00:33:26] some things might be too big for dotfiles [00:34:09] Indeed [00:34:12] it would also be a difference between creating just /home/foo/mira-home-backup/ or really syncing straight into /home/, but in the latter it's the good part that all dotfiles in puppet will be overwritten anyways [00:35:04] talking about backups.. i'm looking at bacula [00:35:07] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:35:17] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [00:35:17] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [00:35:31] icinga-wm: yea yea, we know but flapping makes the ACK go away [00:37:56] yea, we have mira, tin (and also naos confirmed) homes in bacula [00:39:21] If we sync'd them, you'd only need one bacula job [00:39:23] Instead of N [00:40:25] permanently synced? yea.. [00:40:37] i do think that bacula is kind of smart about de-duplication though [00:40:50] but good q [00:41:12] let's just not mount /home from NFS :) [00:41:37] that would be back to old-fenari-times, wouldnt it [00:46:03] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3192172 (10Eevans) According to `mdadm`, only `/dev/md0` is degraded (`/`), but `/dev/md2` (aka `/srv`) is inaccessible as well; I think `/dev/sdc` is failed. What is the ETA for replacement? ``` ``` [00:49:58] (03CR) 10Krinkle: Initial configuration for dtywiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [00:51:35] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10Dzahn) - backups: confirmed with bconsole that naos now exists in Bacula with the same backup sets (/home and /srv/deployment are backed up on deployment ser... [00:55:46] (03PS2) 10Dzahn: deployment: sync home dirs from mira to naos [puppet] - 10https://gerrit.wikimedia.org/r/348886 (https://phabricator.wikimedia.org/T162900) [00:56:32] (03CR) 10Chad: [C: 031] "This worked in beta, shouldn't have any issues in prod" [puppet] - 10https://gerrit.wikimedia.org/r/348884 (owner: 10Chad) [01:02:17] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active [01:02:17] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active [01:02:58] (03CR) 10Dzahn: [C: 032] deployment: sync home dirs from mira to naos [puppet] - 10https://gerrit.wikimedia.org/r/348886 (https://phabricator.wikimedia.org/T162900) (owner: 10Dzahn) [01:05:17] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [01:05:17] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [01:05:22] (03CR) 10Dzahn: "no-op on tin and mira. added rsyncd and config on naos, which will now accept data pushed to it from mira." [puppet] - 10https://gerrit.wikimedia.org/r/348886 (https://phabricator.wikimedia.org/T162900) (owner: 10Dzahn) [01:06:13] (03CR) 10Dzahn: "(puppet does not run the actual rsync command)" [puppet] - 10https://gerrit.wikimedia.org/r/348886 (https://phabricator.wikimedia.org/T162900) (owner: 10Dzahn) [01:07:35] (03PS2) 10Dzahn: l10nupdate: Ensure a default uid of 10002 [puppet] - 10https://gerrit.wikimedia.org/r/348884 (owner: 10Chad) [01:14:09] RainbowSprinkles: found another inconsistency, terbium has l10nupdate user, wasat does not have l10nupdate user, but we are really trying to keep them the same and they have the same roles except "openldap::management" which seems sooo unrelated [01:14:57] i wanted to check that really just deployment servers have this user and not like ALL mw, so i found "one of the 2 maintenance servers but not both" [01:15:53] probably in the past terbium had another (deployment related) role [01:16:12] and then it was adjusted to wasat but nothing deletes the user [01:16:55] but .. it does have the correct uid AND gid, so there's that [01:18:42] (03CR) 10Dzahn: "thanks for checking on beta. i had vague memories about conflicts between puppet users and LDAP users and UIDs from the past there. this i" [puppet] - 10https://gerrit.wikimedia.org/r/348884 (owner: 10Chad) [01:19:28] 06Operations, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192581 (10Eevans) [01:21:44] !log T163292: Starting removal of Cassandra instance restbase1018-a.eqiad.wmnet [01:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:57] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [01:25:45] 06Operations, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192581 (10Dzahn) > Since these instances have already been down for some time, and no ETA for repair/replacement yet exists, down for some time? restbase1... [01:26:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [01:26:07] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title}{/revision} (Get MobileApps Fo [01:26:17] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{titl [01:26:17] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [01:26:19] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192600 (10Dzahn) [01:26:27] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200) [01:26:27] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200) [01:26:27] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [01:26:27] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: [01:26:27] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [01:26:27] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test [01:26:37] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [01:26:37] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test [01:26:37] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev [01:26:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test [01:26:37] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [01:26:37] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Te [01:26:38] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Te [01:26:38] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Tes [01:26:39] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [01:26:39] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test [01:26:40] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision/{revision} (Get rev by ID) is CRITICAL: Test [01:27:17] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [01:27:27] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [01:27:27] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [01:27:27] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [01:27:27] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [01:27:27] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [01:27:27] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [01:27:27] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [01:27:37] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [01:27:38] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [01:27:38] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [01:27:39] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [01:28:07] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [01:28:07] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [01:28:17] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [01:28:47] ah! ok, nice [01:31:55] (03CR) 10Dzahn: [C: 032] l10nupdate: Ensure a default uid of 10002 [puppet] - 10https://gerrit.wikimedia.org/r/348884 (owner: 10Chad) [01:36:02] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192581 (10GWicke) @Dzahn, @eevan's concern is about the Cassandra instances, not the stateless RESTBase service itself. While those instances... [01:36:54] (03CR) 10Dzahn: "confirmed no-op on mira,tin,naos,wasat,terbium,.. (and this user doesnt exist on appservers or elsewhere)" [puppet] - 10https://gerrit.wikimedia.org/r/348884 (owner: 10Chad) [01:38:23] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192621 (10Dzahn) @Gwicke @Eevans thanks for the explanation (and i just saw you removing it and Icinga alerts followed by recoveries. looks g... [01:40:04] 06Operations, 10Deployment-Systems, 13Patch-For-Review: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#3192622 (10Dzahn) now: https://gerrit.wikimedia.org/r/#/c/348884/ [01:40:32] 06Operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#3192623 (10Dzahn) [01:41:06] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192624 (10Eevans) >>! In T163292#3192598, @Dzahn wrote: >> Since these instances have already been down for some time, and no ETA for repair... [01:44:55] ACKNOWLEDGEMENT - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed eevans https://phabricator.wikimedia.org/T163292 [01:45:19] ACKNOWLEDGEMENT - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans https://phabricator.wikimedia.org/T163292 [01:45:47] PROBLEM - HP RAID on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [01:47:27] !log rsyncing /home from mira to naos (T162900) [01:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:35] T162900: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900 [01:49:04] (03PS1) 10Dzahn: Revert "deployment: sync home dirs from mira to naos" [puppet] - 10https://gerrit.wikimedia.org/r/348892 [01:50:18] (03CR) 10Dzahn: [C: 032] "reverting as planned, sync is done and once now must be enough. (there is also bacula)" [puppet] - 10https://gerrit.wikimedia.org/r/348892 (owner: 10Dzahn) [01:50:23] (03PS2) 10Dzahn: Revert "deployment: sync home dirs from mira to naos" [puppet] - 10https://gerrit.wikimedia.org/r/348892 [01:51:03] (03PS1) 10Krinkle: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348893 (https://phabricator.wikimedia.org/T145337) [01:52:20] (03CR) 10Jforrester: [C: 031] Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348893 (https://phabricator.wikimedia.org/T145337) (owner: 10Krinkle) [01:55:37] RECOVERY - HP RAID on restbase1014 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5, Controller, Battery/Capacitor [01:56:13] !log naos: manually deleting rsyncd config remnants (puppet wouldn't know to clean up after itself) [01:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:11] !log naos: rsyncd is of course legitimately running on a deployment server sepearate from this (unlike in other cases where we used it for syncing during migration), so this was just the one config fragment for /home and not removing the service or anything [01:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:43] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3192704 (10Dzahn) [02:00:28] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10Dzahn) [02:02:46] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3192706 (10Dzahn) [02:07:02] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3192709 (10Dzahn) [02:08:30] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10Dzahn) [02:28:00] mutante: Is https://phabricator.wikimedia.org/T79786 solved? (uid for mwdeploy on app servers) [02:28:06] (03PS6) 10Andrew Bogott: Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) [02:28:10] I assume it is, but maybe it goes by unnoticed? [02:29:54] (03CR) 10Andrew Bogott: [C: 032] Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) (owner: 10Andrew Bogott) [02:35:44] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3192764 (10Krinkle) [02:36:40] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3192765 (10Krinkle) [02:37:01] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3192766 (10Krinkle) [02:58:58] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192793 (10Eevans) The error rate is currently quite high, a lot of timeouts starting at the point of the RAID failure: https://logstash.wiki... [03:05:10] (03PS1) 10Dzahn: icinga: wrap run-no-puppet around sync_icinga_state [puppet] - 10https://gerrit.wikimedia.org/r/348898 (https://phabricator.wikimedia.org/T163286) [03:06:21] (03PS2) 10Dzahn: icinga: wrap run-no-puppet around sync_icinga_state [puppet] - 10https://gerrit.wikimedia.org/r/348898 (https://phabricator.wikimedia.org/T163286) [03:06:55] (03PS3) 10Dzahn: icinga: wrap run-no-puppet around sync_icinga_state [puppet] - 10https://gerrit.wikimedia.org/r/348898 (https://phabricator.wikimedia.org/T163286) [03:08:29] (03CR) 10Dzahn: [C: 032] icinga: wrap run-no-puppet around sync_icinga_state [puppet] - 10https://gerrit.wikimedia.org/r/348898 (https://phabricator.wikimedia.org/T163286) (owner: 10Dzahn) [03:10:25] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192581 (10mobrovac) Because this is happening in eqiad, where asynchronous updates are processed, I will lower the CP processing concurrency... [03:30:33] (03CR) 10Jeroen De Dauw: [C: 031] Remove https://sourcecode.berlin/feed/ from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [03:31:37] !log mobrovac@tin Started deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 [03:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:47] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [03:32:31] !log mobrovac@tin Finished deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 (duration: 00m 53s) [03:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:38] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [03:38:47] RECOVERY - Host lvs2001 is UP: PING WARNING - Packet loss = 28%, RTA = 36.10 ms [03:40:56] !log mobrovac@tin Started restart [restbase/deploy@1bfada4]: Kick RB to pick up restbase1018 instances are gone [03:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:57] !log mobrovac@tin Started restart [restbase/deploy@1bfada4]: (no justification provided) [03:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:43] !log T163292: Starting removal of Cassandra instance restbase1018-b.eqiad.wmnet [03:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:51] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [03:54:37] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:55:11] ok looking into that ^, but it's not surprising [03:55:38] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [03:58:27] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [03:59:07] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [03:59:27] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [04:00:07] RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy [04:16:52] ACKNOWLEDGEMENT - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T163243 [04:18:41] ACKNOWLEDGEMENT - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T163280 [04:29:59] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192867 (10Eevans) Update: After throttling the `removenode` operation, reducing transclusion concurrency, and restartin... [04:30:25] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3192868 (10Dzahn) > Puppet was not running because of an Icinga configuration error puppet runs alright now, no errors > couldn't find tegmen on Icinga (we... [04:32:41] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192869 (10Eevans) p:05Triage>03High [04:32:56] (03CR) 10Dzahn: dnsrec/icinga: add child/parent rel between monitor hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347984 (owner: 10Dzahn) [04:33:34] (03PS4) 10Dzahn: dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 [04:45:27] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:37] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:57] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:58] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:58] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:07] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:07] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:08] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:08] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:08] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:08] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:17] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:17] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:17] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:27] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:27] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:27] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:47] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:46:47] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:46:47] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:46:57] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:46:57] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [04:46:57] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:46:58] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:46:58] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:46:58] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [04:47:17] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:17] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:18] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:18] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:47:27] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:48:07] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:48:07] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:48:08] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:59:47] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:47] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:47] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:58] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:58] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:58] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:07] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:08] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:08] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:08] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:08] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:08] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:17] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:17] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:18] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:37] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:00:37] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:00:38] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:00:57] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:00:57] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:00:57] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:00:57] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:00:57] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:00:57] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:00:57] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:00:58] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:00:58] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:01:07] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:01:07] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:01:07] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:17:07] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 109, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/14: down - frdb1002BR [05:37:37] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:53:57] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3406.60 Read Requests/Sec=2312.00 Write Requests/Sec=36.60 KBytes Read/Sec=36016.40 KBytes_Written/Sec=11888.80 [05:54:22] I will silence dbstore1001, it is most likely because of the backups running [06:02:57] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=186.90 Read Requests/Sec=314.70 Write Requests/Sec=2.90 KBytes Read/Sec=7046.80 KBytes_Written/Sec=349.20 [06:05:36] 06Operations, 10DBA: Increase timeout for mariadb replication check - https://phabricator.wikimedia.org/T163303#3192927 (10Marostegui) [06:05:37] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:07:46] (03CR) 10Marostegui: "I have added the grants to silver" [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [06:12:40] (03CR) 10Marostegui: [C: 031] ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [06:20:42] (03CR) 10Marostegui: [C: 031] "Thanks for cleaning this up!" [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn) [06:48:27] (03CR) 10Muehlenhoff: [C: 032] "Looks good, merging" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348857 (owner: 10Chad) [06:52:15] <_joe_> !log artificially stopping slave replication on rdb2001 for a final test of the switchover redis stage [06:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:02] <_joe_> redis alarms on rdb2001 are expected. [06:56:18] PROBLEM - Check health of redis instance on 6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 26 days 14 hours [06:56:18] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7543106 keys, up 26 days 14 hours [06:56:59] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t06_redis(codfw, eqiad) Switch the Redis replication [06:57:03] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t06_redis(codfw, eqiad) Successfully completed [06:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:18] RECOVERY - Check health of redis instance on 6378 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 26 days 14 hours - replication_delay is 0 [06:57:59] <_joe_> \o/ [06:58:02] <_joe_> :) [06:58:27] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:27] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7528455 keys, up 26 days 14 hours - replication_delay is 0 [07:04:47] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:37] RECOVERY - Host lvs2001 is UP: PING WARNING - Packet loss = 80%, RTA = 36.09 ms [07:16:46] <_joe_> uhm [07:16:54] <_joe_> this happened yesterday as well [07:26:52] !log Updated the sites and site_identifiers tables on all Wikidata clients for T149522. [07:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:01] T149522: Create Wikisource Eastern Punjabi - https://phabricator.wikimedia.org/T149522 [07:52:02] (03CR) 10WMDE-leszek: [C: 031] Remove https://sourcecode.berlin/feed/ from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [08:06:49] (03PS2) 10Urbanecm: Remove all feeds added in T127176 from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) [08:09:23] (03CR) 10Dereckson: "I concur, commits message should contain:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [08:13:44] (03PS3) 10Dereckson: Initial configuration for dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [08:14:45] (03CR) 10Dereckson: [C: 031] "Configuration is ready." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [08:16:32] (03CR) 10Dereckson: "Commits messages should contain a "why", a short sentence to refresh memory about why we remove that without having to read again all the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [08:40:15] 06Operations, 10Monitoring, 07LDAP, 13Patch-For-Review: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3193103 (10MoritzMuehlenhoff) It also doesn't work with the OpenLDAP command line tools: Passing the "pr" control is supposed to enable paged searches, but this still bu... [08:43:17] (03PS1) 10Elukey: Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) [08:43:33] (I am not merging anything just working on some tasks :) [08:54:37] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1737 MB (3% inode=84%) [09:03:03] 06Operations, 10Traffic, 10netops: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193128 (10Volans) [09:05:46] checking ocg [09:07:16] usual post-mortem dir filled u [09:07:18] *up [09:11:32] !log cleaning up ocg1003's /srv/deployment/ocg/postmortem dir (root partition filled up) [09:11:37] RECOVERY - Disk space on ocg1003 is OK: DISK OK [09:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:37] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:17:47] RECOVERY - Host lvs2001 is UP: PING OK - Packet loss = 16%, RTA = 36.04 ms [09:19:25] (03PS1) 10Muehlenhoff: Fix configuration of size limits to allow paged LDAP search requests [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) [09:19:53] and again... :( [09:23:30] 06Operations, 13Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#3193169 (10elukey) Theoretically when we'll have switched over these job queue hosts will only be replicas of codfw, so it should be super fine to just re-image them one at the time. Redis on these host... [09:25:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:25:54] 06Operations, 10Traffic, 10netops: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193170 (10Volans) Ping from various codfw hosts confirms packet loss: - from `lvs2004` ``` --- lvs2001.codfw.wmnet ping statistics --- 165 packets transmitted, 146 received, 1... [09:29:15] (03CR) 10Gehel: [C: 04-1] "Looks good, but let's wait until the DC switch is completed to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/345632 (https://phabricator.wikimedia.org/T161830) (owner: 10EBernhardson) [09:30:47] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:34:00] !log installing dbus security updates [09:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] (03PS3) 10WMDE-leszek: Remove all feeds added in T127176 from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [09:40:05] (03CR) 10WMDE-leszek: [C: 031] "I've added a sentence of explanation. I hope it's fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [09:41:56] 06Operations, 10Traffic, 10netops: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193128 (10ema) Note that, perhaps interestingly, the number of Icmp_Outmsgs on lvs2001 reached 1000 at a certain point and then flattened there. {F7631080} [10:20:35] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#3193190 (10hashar) Looked again at this one, the root cause is salt grains.set uses yaml to save the grain value and that is processed via YAML. Henc... [10:24:31] (03PS1) 10Ema: lvs: bump net.ipv4.icmp_msgs_per_sec [puppet] - 10https://gerrit.wikimedia.org/r/348925 (https://phabricator.wikimedia.org/T163312) [10:25:11] (03CR) 10Alexandros Kosiaris: [C: 031] lvs: bump net.ipv4.icmp_msgs_per_sec [puppet] - 10https://gerrit.wikimedia.org/r/348925 (https://phabricator.wikimedia.org/T163312) (owner: 10Ema) [10:28:37] (03CR) 10Alexandros Kosiaris: [C: 031] "Can't say I 've seen many applications using paged results so I am not fully clear yet on the usage but this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [10:29:35] (03CR) 10Alexandros Kosiaris: [C: 032] "Let's merge this. I 'd hate to have icinga alerts during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/348925 (https://phabricator.wikimedia.org/T163312) (owner: 10Ema) [10:29:49] (03CR) 10Ema: [V: 032 C: 032] lvs: bump net.ipv4.icmp_msgs_per_sec [puppet] - 10https://gerrit.wikimedia.org/r/348925 (https://phabricator.wikimedia.org/T163312) (owner: 10Ema) [10:30:57] PROBLEM - Check Varnish expiry mailbox lag on cp3044 is CRITICAL: CRITICAL: expiry mailbox lag is 644687 [10:33:41] (03PS2) 10Filippo Giunchedi: Switch deployment CNAMEs to naos.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/348060 [10:39:34] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193210 (10Volans) 05Open>03Resolved p:05Triage>03High a:03Volans Increased the max ICMP out packets to 3000 to overcome the bottleneck. Packet l... [10:41:29] !log depool varnish-be on cp3044 because of mailbox lag issues [10:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:41] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3193215 (10fgiunchedi) Looking at the situation on naos, it looks like an accidental upgrade via `hhvm-dbg` Initial puppet run, install `hhvm` ``` Start-Date: 2017-04-14 00:01:24 Commandline: /usr/bin/apt... [10:47:18] 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3193217 (10fgiunchedi) Upstream issue: https://github.com/prometheus/procfs/issues/40 [10:50:29] (03CR) 10Muehlenhoff: "Paging is quite popular in Active Directory. Our WMF use case is something that Andrew and Bryan are working on for Labs, probably Stryker" [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [10:50:46] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3192140 (10Joe) terbium will be upgraded to jessie as soon as we've switched over, for the record. [10:50:57] RECOVERY - Check Varnish expiry mailbox lag on cp3044 is OK: OK: expiry mailbox lag is 0 [10:51:09] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3193224 (10elukey) @Cmjohnson do we need to order replacement parts for this host or is it simply into an inconsistent state? [10:56:48] <_joe_> !log running the warmup stage in codfw for final testing [10:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:59] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3193229 (10elukey) The only thing that we need to upgrade (and probably @MoritzMuehlenhoff has already scheduled it) are the mwdebug servers, since the rest is Trusty and I don't believe that we'll do any at... [10:57:22] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) wipe and warmup caches [10:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:55] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3193231 (10MoritzMuehlenhoff) It's unproblematic to also upgrade mwdebug* to 3.18.2, the only difference is a backported patch which only shows up in production load after 4-5 hours. The deployment servers... [10:59:54] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3193233 (10fgiunchedi) I've downgraded hhvm-related packages back to their non-experimental version. It looks like the root cause is `experimental` and `main` components of `jessie-wikimedia` having the sam... [11:01:34] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3193236 (10Volans) @akosiaris: I've found that the catalog for `tegmen` doesn't have `Nagios_Host` and `Nagios_Service` resources and I think this is due bec... [11:03:02] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) Successfully completed [11:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:00] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3193243 (10fgiunchedi) A related issue discovered in T163278 is a consideration of APT priority between components (and/or distros, if multiple) so that packages are picked up from the right place in all cases... [11:14:30] (03PS1) 10Filippo Giunchedi: Switch deployment server to naos.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/348927 [11:23:40] !log add naos to git-deploy term on common-infrastructure4 - T162900 [11:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:49] T162900: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900 [11:24:24] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3193274 (10fgiunchedi) [11:25:02] 06Operations, 07Puppet, 10Deployment-Systems, 07Beta-Cluster-reproducible: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#3193275 (10hashar) [11:25:12] (03PS1) 10Hashar: salt: fix grain-ensure comparison [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) [11:27:37] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 400 (expecting: 200) [11:27:47] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:04] <_joe_> uhm [11:28:27] PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:28:37] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:29:07] PROBLEM - cassandra-b service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:29:07] PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:29:17] PROBLEM - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:29:17] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused [11:29:37] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:30:01] (03CR) 10Muehlenhoff: [C: 031] Switch deployment server to naos.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/348927 (owner: 10Filippo Giunchedi) [11:30:17] PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:30:17] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused [11:30:17] PROBLEM - cassandra-c service on restbase2005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:30:25] this might be the "usual OOM" [11:30:45] <_joe_> yes, but that is the dc serving users [11:30:52] <_joe_> so it's worrying [11:31:57] 2005 logged java.lang.OutOfMemoryError: Java heap space [11:32:12] <_joe_> yeah I was looking as well [11:32:18] _joe_ so restbase async is in eqiad now right ? [11:32:33] <_joe_> yes [11:32:53] <_joe_> elukey: https://config-master.wikimedia.org/discovery/discovery-basic.yaml [11:33:27] new this year: config-master is running codfw only :) [11:33:51] <_joe_> yes, thanks to our new puppetmaster architecture [11:34:58] definitely something new, I don't see the tombstone logs this time [11:35:07] RECOVERY - cassandra-b service on restbase2010 is OK: OK - cassandra-b is active [11:35:07] RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational [11:35:11] (as expected since they are mostly restbase-async related) [11:36:10] !log repool varnish-be on cp3044 [11:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:17] RECOVERY - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-b valid until 2017-11-17 00:54:25 +0000 (expires in 211 days) [11:36:41] !log oblivian: Setting swift-rw in codfw UP [11:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:58] !log oblivian: Setting swift-rw in eqiad DOWN [11:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:35] running puppet on 2005 to bring cassandra up [11:38:47] PROBLEM - Check Varnish expiry mailbox lag on cp3037 is CRITICAL: CRITICAL: expiry mailbox lag is 658484 [11:39:17] RECOVERY - Check systemd state on restbase2005 is OK: OK - running: The system is fully operational [11:39:17] RECOVERY - cassandra-c service on restbase2005 is OK: OK - cassandra-c is active [11:39:37] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [11:40:27] RECOVERY - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-c valid until 2017-09-12 15:35:38 +0000 (expires in 146 days) [11:41:52] wah wah, naos is me [11:43:21] (03CR) 10Hashar: "I have cherry picked it on the beta puppet master. The deployment_server: true grain is now properly recognized which prevents puppet fro" [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) (owner: 10Hashar) [11:43:37] <_joe_> godog: is naos ready? [11:45:48] 06Operations, 07Puppet, 10Deployment-Systems, 07Beta-Cluster-reproducible, 13Patch-For-Review: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#3193330 (10hashar) https://gerrit.wikimedia.org/r/#/c/348928/ changes grain-ensure so it normalizes the v... [11:46:12] (03PS1) 10Marostegui: mariadb: Start puppetizing tendril users [puppet] - 10https://gerrit.wikimedia.org/r/348930 [11:47:28] _joe_: I'm double checking /srv and it seems mostly in order yeah [11:48:20] <_joe_> godog: ref is last time we brought up a deployment server I remember some issues happened in /srv/mediawiki-staging [11:48:44] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/6173/" [puppet] - 10https://gerrit.wikimedia.org/r/348930 (owner: 10Marostegui) [11:49:30] (03PS2) 10Marostegui: mariadb: Start puppetizing tendril mysql users [puppet] - 10https://gerrit.wikimedia.org/r/348930 (https://phabricator.wikimedia.org/T148955) [11:49:37] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is OK: Files ownership is ok. [11:50:33] _joe_: yeah I think it had to do with some "service" users not having fixed uid and rsync not doing the right thing since we're running the server in a chroot [11:51:27] <_joe_> godog: so let's check the service users have the same uids :) [11:51:34] <_joe_> I'll do it [11:52:00] I'm checking it as we speak, mwdeploy was the last one missing afaict [11:54:33] <_joe_> bacula, diamond and prometheus have different uids, but they don't really matter [11:56:08] indeed [11:58:48] RECOVERY - Check Varnish expiry mailbox lag on cp3037 is OK: OK: expiry mailbox lag is 6 [12:09:43] 06Operations, 06Performance-Team: Some Core availability Catchpoint tests might be more expensive than they need to be - https://phabricator.wikimedia.org/T162857#3193381 (10Peter) I did some checks, in *Core Services Availability* we do two payment checks with Chrome against the API and check if it returns OK... [12:17:08] So, if it’s a *fail*over test shouldn’t you just, like, flip the breaker and see what happens? [12:19:53] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10Marostegui) Cool, I will talk to Jaime tomorrow in our weekly meeting and we will try to see how to fit our stuff before/after it. I w... [12:26:45] Revent: yeah a couple of things stand in the way of that: (1) Not all of our non-user-facing infrastructure is yet redundant, so we're only testing the parts that are directly facing users, not all of the other miscellaneous smaller things still living only in 1 DC effectively (all of which are slowly migrating towards similar levels of redundancy) and (2) Any failover that begins with an actu [12:26:51] al outage (a real one, or us flipping off a big power switch) will definitely impact users' ability to reach us for at least a short period of time, whereas these tests are designed to be relatively smooth and non-user-impacting (other than the brief readonly period for editors), so that we can hopefully exercise them increasingly-often as we continue to improve our redundant infra. [12:27:12] bblack: I was totally kidding. :) [12:27:21] yeah but it's a legitimate question :) [12:27:42] bblack: The proper method to test a failure mode is not to imitate Chernobyl. [12:28:25] I suspect someday we'll get closer, though. One of the long-term ideals in my head is we get to where we could (not routinely!) conduct a test where we shut off all the external network connectivity of one of the core DCs and handle it fine. [12:28:55] but it would be user-impacting for a handful of minutes while our stuff and the internet reacts to that [12:29:01] Hey, backhoe attenuation is real. :/ [12:29:32] we try to design around backhoe attenuation, but it doesn't always work [12:29:51] (we've had cases before where redudant links supposedly on different paths from different vendors were taken out by a single fiber cut!) [12:29:57] PROBLEM - Check Varnish expiry mailbox lag on cp3044 is CRITICAL: CRITICAL: expiry mailbox lag is 592497 [12:30:55] Years ago, on Slashdot… “Always carry a length of fiber-optic cable in your pocket. Should you be shipwrecked and find yourself stranded on a desert island, bury the cable in the sand. A few hours later, a guy driving a backhoe will be along to dig it up. Ask him to rescue you.” [12:38:52] !log T163292: Starting removal of Cassandra instance restbase1018-c.eqiad.wmnet [12:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:02] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [12:44:17] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned [12:44:17] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status [12:44:17] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected stat [12:44:27] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) is CRITICAL: Test retrieve the selected anniversaries for January 15 returned the unexpected status 500 (expecting: 200) [12:44:27] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get r [12:44:27] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the une [12:44:27] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test ret [12:44:27] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/revision/{revision [12:44:27] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [12:44:28] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test ret [12:44:37] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) is CRIT [12:44:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITIC [12:44:37] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/revision/{revision} [12:44:37] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) is CRITICAL: Test retrieve the selected anniversaries for January 15 returned the une [12:44:37] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get r [12:44:38] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title}{/revision} (Get MobileAp [12:44:38] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test ret [12:45:17] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [12:45:17] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:45:17] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [12:45:27] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [12:45:28] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 645807 [12:45:28] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [12:45:28] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [12:45:28] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [12:45:28] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [12:45:28] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [12:45:37] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [12:45:37] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [12:45:37] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [12:45:37] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [12:45:37] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [12:45:37] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [12:45:37] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [12:45:38] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [12:45:38] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [12:45:40] what is this? [12:45:42] urandom: is that you? [12:46:04] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3193430 (10akosiaris) >>! In T163286#3192868, @Dzahn wrote: >> Puppet was not running because of an Icinga configuration error > puppet runs alright now, no... [12:46:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:46:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [12:46:38] restbase and scb boxes ? [12:47:09] paravoid: looking [12:49:01] mobileapps misbehaving and restbase reporting it as well [12:49:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:50:18] paravoid, bblack, librenms is showing interface errors to lvs2002-eth2 : https://librenms.wikimedia.org/device/device=97/tab=port/port=9967/ [12:50:38] not sure if known issue, but mentioning it because of the switchover [12:51:02] XioNoX: yeah, but it is in the ~10mp/s range [12:51:15] elukey noticed it this morning as well [12:51:41] XioNoX: worth to open a phab task imo :) [12:51:58] I am thinking it should not block the swithover but definitely worth investigating a bit more [12:52:05] it's seems to be happening for quite a while [12:53:06] I don't see anything in logstash for mobileapps worth mentioning. Some dead worker restarts by that's about it [12:53:40] ah but the per host logs have something [12:54:04] <_joe_> akosiaris: it's typically some error from the backend they're calling; [12:54:16] _joe_: it's mobileapps.. not citoid [12:54:29] the backend for mobileapps is .. well us [12:54:44] <_joe_> akosiaris: mobileapps is basically a mesh of results from the API and other services [12:54:48] <_joe_> via restbase IIRC [12:55:39] not the API, parsoid [12:55:48] via restbase ofc as you point out [12:56:10] akosiaris@scb2004:/srv/log/mobileapps$ grep '"message":"500:' main.log |wc -l [12:56:10] 2799 [12:56:17] nice [12:56:29] oh crap.. those logs are old [12:56:35] well not rotated, not old [12:56:36] grrr [12:57:00] elukey: did you open a task? Framing errors will most likely going to need to re-seat or replace optics. [12:57:10] <_joe_> akosiaris: why not rotated? [12:57:28] XioNoX: nope not yet [12:57:47] <_joe_> it's 159Mb [12:58:07] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [12:58:11] _joe_: and containing entries since 2016 [12:58:26] <_joe_> and this ^^ is a cassandra problem, probably [12:58:35] akosiaris@scb2004:/srv/log/mobileapps$ grep '"message":"500:' main.log | head -1 [12:58:35] ..."time":"2016-10-20T08:47:19.063Z" [12:58:54] _joe_: https://github.com/wikimedia/puppet/blob/production/modules/systemd/templates/logrotate.erb [12:58:57] it's the size parameter [12:59:07] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [12:59:11] size and daily and incompatible [12:59:14] we should remove size [12:59:27] elukey: BTW.... [12:59:27] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [12:59:33] I 've already done in another patch but it turned out it was for ORES only [12:59:42] (just to glance at for a sec, no action needed) [12:59:47] https://commons.wikimedia.org/wiki/Special:TimedMediaHandler [12:59:58] *happiness* [13:00:13] _joe_: and then stumbled across the mess of systemctl reload not working currently for services cause they don't have an ExecReload= stanza [13:00:18] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:00:26] Revent: :) [13:00:27] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:00:32] <_joe_> akosiaris: correctly, as they don't reload [13:00:37] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:00:46] _joe_: which practically makes rotating the logs a mess [13:00:47] PROBLEM - HP RAID on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:01:10] _joe_: cause we don't want to restart them to get the logs rotated.. cause then... outage [13:01:38] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:02:01] <_joe_> akosiaris: ok [13:02:17] :-( [13:02:17] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:02:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:02:37] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:02:37] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:02:41] elukey: That included someone (mostly Dispenser) going through and purging *every* video [13:02:43] I think you are correct about this being cassandra related btw [13:02:53] it no longer looks like mobileapps but rather restbase [13:03:02] <_joe_> told you :) [13:03:07] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:03:12] the /page/summary endpoint is not provided by mobileapps IIRC [13:03:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:03:37] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:03:41] <_joe_> akosiaris: nope, it also says "from storage" [13:03:49] yup [13:03:52] which is a good pointer [13:03:59] i'm still looking at that; i think there is more than one thing at play [13:04:27] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:04:33] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3193479 (10fgiunchedi) [13:04:35] we're removing the node instances on 1018 because the raid array failed [13:04:37] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:04:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:04:37] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:04:41] <_joe_> urandom: let us know how we can help [13:04:47] (03CR) 10Hashar: [C: 031] "Gehel CI does not run rake spec yet. So this change can be merged at anytime ;-}" [puppet] - 10https://gerrit.wikimedia.org/r/345849 (owner: 10Hashar) [13:04:57] having the down nodes was causing problems, but the topology change is too [13:05:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [13:05:41] and (unrelated?) there were some node crashes in codfw that aren't back up yet, i think because they're trying to join the ring while a topology change is on-going [13:06:07] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [13:06:10] not sure about that, they do look like they'll come backup eventually (they're working on it) [13:06:27] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:06:37] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:07:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:07:09] (03PS1) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [13:07:18] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:07:27] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:07:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:07:37] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:08:05] urandom: what's the status? [13:08:07] (03CR) 10jerkins-bot: [V: 04-1] Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [13:08:12] urandom: we have the switchover in... 52 minutes [13:08:37] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:08:37] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [13:09:33] 06Operations, 06DC-Ops, 10Traffic, 10netops: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3193493 (10ayounsi) [13:09:37] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:09:37] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:09:37] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:10:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:10:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [13:10:37] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [13:10:37] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:10:57] urandom: I can see only restbase2010-b still DN in codfw, is it only a eqiad DC related issue? (Affecting only restbase-async) [13:11:13] (due to 1018 marked down) [13:11:17] RECOVERY - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.187 port 9042 [13:11:17] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:11:30] ah there you go, 2010 up [13:11:37] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:12:00] elukey: 2005-c too, but i think it'll come back up now [13:12:26] urandom: it seems UN now in status [13:12:37] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:13:11] paravoid: so, i think this is limited to eqiad now [13:13:17] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.48 port 9042 [13:13:34] 06Operations, 06DC-Ops, 10Traffic, 10netops: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3193512 (10BBlack) To do a soft-ish failover, on lvs2002 we can disable the puppet agent and stop pybal temporarily, wait a few minutes for traffic to settle over to lvs2005, a... [13:13:37] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:13:53] urandom: I can see errors like "Error: User restb has no SELECT permission on " for restbase1012 in logstash [13:13:57] yes [13:14:07] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:14:18] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:14:27] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:14:29] the removenode hung because we had nodes down in codfw, and the codfw wouldn't come up because we had a topology change [13:14:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [13:14:37] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:14:37] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:14:48] sigh [13:14:51] so i terminated the remove node, and now we're down one replica for system_auth [13:15:07] which is causing some errors in eqiad, but codfw should be OK [13:15:07] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [13:15:23] (which is where we're serving client reads, afaik) [13:15:33] yep yep [13:15:37] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [13:15:43] yeah but even the alert noise is *very* unfortunate [13:15:50] paravoid: yes [13:16:00] i'm running repairs on system_auth now [13:16:08] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [13:16:27] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:16:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:17:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:18:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:18:20] (03CR) 10Gehel: [C: 031] "@hashar: Noted, I'll add this to my list of pending merges..." [puppet] - 10https://gerrit.wikimedia.org/r/345849 (owner: 10Hashar) [13:18:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:19:07] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [13:19:15] paravoid: i'm not sure what else to do atm, elukey once fixed this by adding and removing the user, but that would almost certainly impact codfw [13:19:17] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:19:25] urandom: (just to understand) - 500s are due to instances using 1018-c to verify system_auth credentials ? [13:19:27] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:19:37] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:19:37] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:19:57] RECOVERY - Check Varnish expiry mailbox lag on cp3044 is OK: OK: expiry mailbox lag is 1669 [13:20:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:20:24] elukey: no, but that raises a good point [13:20:27] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:20:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:20:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [13:20:46] those instances have been down, i wonder, did that cache just expire? [13:20:56] what are these Text HTTP 5xxs? [13:21:13] <_joe_> paravoid: I guess rb-related, I'm checking [13:21:27] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:21:32] urandom: from restbase1012: There was an error when trying to connect to the host 10.64.48.100 (that is 1018-c) [13:21:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:21:37] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:21:37] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:21:37] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:22:04] urandom: so what's the current status? are you expecting this to fix itself or are you doing something to fix it yourself? [13:22:20] i'm running repairs on the system_auth tables [13:22:26] they... take a while [13:22:27] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:22:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:22:37] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:22:59] how much is a while? [13:23:07] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:23:15] i'm trying to figure that out, sec [13:23:37] <_joe_> yeah a few 5xx now, but mostly rb-related [13:23:49] so this is actually user-visible then? [13:23:49] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3193528 (10Volans) >>! In T163286#3193430, @akosiaris wrote: > What was that configuration error ? Sorry my bad, I was probably too tired to copy, paste and... [13:24:08] <_joe_> also a few xhgui-related, but those should be on misc [13:24:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:24:39] (03CR) 10Andrew Bogott: [C: 031] "Looks ok to me -- have you already done a test of this by hand on labtest? If not, I can do it." [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [13:25:05] paravoid: this should be restbase-async related if I got it correctly [13:25:08] <_joe_> paravoid: it is, I think, but the number is small enough [13:25:27] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:25:36] both _joe_ and bblack found user-facing 5xxs for restbase [13:25:37] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:25:37] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:25:37] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:25:54] yep yep going to shut up :) [13:25:55] (03CR) 10Muehlenhoff: "I had tested this on labtestcontrol, but please doublecheck, I should have re-enabled puppet to restore the original slapd.conf" [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [13:26:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:26:37] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [13:26:37] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:26:43] urandom: this looks very similar to the AQS mess, I am wondering if the final fix will be to restore restb perms manually [13:28:02] elukey: just did that [13:28:07] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [13:28:11] can you please !log? [13:28:27] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:28:47] and also please explain a little bit what's the current impact because we have conflicting information I think [13:28:49] !log cqlsh -f /etc/cassandra/adduser.cql, recreating user/perms (as-needed) [13:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:05] paravoid: that is idempotent [13:29:16] it shouldn't do anything unless something needs to be done [13:29:22] I think you guys said before that this shouldn't be affecting users which are codfw-only now, but 5xxs indicate otherwise I think? [13:29:57] PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:30:05] also, still waiting to hear what's "a while" going to be? [13:30:06] what the hell [13:30:07] PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:12] is it minutes, hours, days? [13:30:17] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:30:27] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:30:37] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:30:47] paravoid: since i'm still trying to figure this out, you'd better think worst case [13:30:47] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused [13:31:00] what does that mean? [13:31:03] like that, 1015-c... i have no idea [13:31:07] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:31:09] none [13:31:14] (03CR) 10Andrew Bogott: [C: 031] "I tested this and it doesn't break anything obvious. I'll merge when I'm back from running an errand." [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [13:31:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:31:44] what does "worst case" mean? [13:31:52] you asked minutes, hours, days [13:31:54] I guess no ETA [13:31:56] <_joe_> paravoid: ok the 5xx spikes are due to what follows: something requires a ton of pages on mobileapps for unsupported languages [13:32:01] urandom: qq - I can still see logs related to "failed to contact 1018-c" - anything that we can do to make Cassandra undestand that it should not contact it? [13:32:26] paravoid: so assume hours [13:32:31] _joe_ "something" is external? [13:32:37] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:33:07] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [13:33:08] <_joe_> jynus: not sure, I'm looking into it [13:33:21] urandom: [13:33:21] 16:28 <@paravoid> and also please explain a little bit what's the current impact because we have conflicting information I think [13:33:25] 16:29 <@paravoid> I think you guys said before that this shouldn't be affecting users which are codfw-only now, but 5xxs indicate otherwise I think? [13:33:27] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:33:37] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200) [13:33:41] I will try to help, although 5XX seem to not be ongoing anymore [13:34:07] RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational [13:34:17] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [13:34:27] there is an increasing amount of total requests, though [13:34:37] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [13:34:37] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:34:42] to be clear about the user impact, it is from user queries routing via restbase.codfw.wmnet getting 500s (as opposed to some kind of misroute of requests to eqiad or some such) [13:34:51] <_joe_> so out of the last 30 K 5xx entries, 23.7K are for /api/rest_v1/ [13:34:55] <_joe_> talking about varnish [13:35:19] most of the remainder are known error classes on upload.wm.o, phab, perf, etc [13:35:19] paravoid: i'm looking at https://grafana-admin.wikimedia.org/dashboard/db/restbase?from=now-12h&to=now&panelId=14&fullscreen&orgId=1 [13:35:22] there is an older spike, and a new one sice :27 [13:35:36] paravoid: and i'm not seeing a high rate of 5xxs, which are you referring to? [13:35:37] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:35:37] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&panelId=22&fullscreen&orgId=1 [13:35:49] urandom: two lines above :) [13:36:14] or the bottom panels on: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:36:14] _joe_ and bblack's reports is what I'm referring to [13:36:28] bblack@oxygen:/srv/log/webrequest$ tail -10000 5xx.json|grep -v upload|grep -v performance|grep -v phab.wmfuser|wc -l [13:36:31] 7120 [13:36:33] bblack@oxygen:/srv/log/webrequest$ tail -10000 5xx.json|grep -v upload|grep -v performance|grep -v phab.wmfuser|grep '/api/rest'|wc -l [13:36:36] 6688 [13:37:17] ^ this means that basically, if we exclude other known error sources (upload.wikimedia.org , performance.wikimedia.org , phab.wmfusercontent.org - all throwing some minor errors and totally unrelated to cache_text ), RB accounts for ~94% of the 5xx on the cache_text graphs [13:40:37] (03PS3) 10Giuseppe Lavagetto: Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 [13:40:45] (03CR) 10Volans: [C: 04-1] "Today is switchover day, please do not merge puppet changes" [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [13:40:47] RECOVERY - HP RAID on restbase1014 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5, Controller, Battery/Capacitor [13:41:59] (03PS1) 10Ayounsi: LibreNMS macro for T133852 and T80273 [puppet] - 10https://gerrit.wikimedia.org/r/348941 [13:42:07] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 109, down: 0, dormant: 0, excluded: 3, unused: 0 [13:42:31] urandom: status? [13:42:49] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3193556 (10akosiaris) >>! In T163286#3193528, @Volans wrote: >>>! In T163286#3193430, @akosiaris wrote: >> What was that configuration error ? > > Sorry my... [13:42:57] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2014.codfw.wmnet,service=varnish-be [13:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:39] paravoid: there are/were many things at play here, that i do not (yet) have an answer for, but impact in codfw isn't something i expected [13:44:03] since everything is green there [13:44:06] does that mean that you confirm there is impact in codfw but you don't know why yet? [13:44:06] urandom: restbase1018 has all the cassandra instances failed [13:44:18] elukey: yes [13:44:26] that is a raid failure [13:44:34] paravoid: well, you're telling me that it is [13:44:42] sure, but I thought only one instance was down.. Okok got it [13:44:52] now it makes a bit more sense, three total down [13:44:54] elukey: no, all 3 [13:45:06] yep yep [13:46:05] paravoid: i'm not seeing that in logstash for restbase [13:46:11] or on the restbase dashboards [13:47:07] PROBLEM - Host tools.wmflabs.org is DOWN: CRITICAL - Time to live exceeded (tools.wmflabs.org) [13:47:20] wtf [13:47:38] tools seem timing out [13:47:42] anything else?!?! [13:47:46] confirmed down. [13:47:48] the world does not want the switchover to happen :P [13:47:57] although tools is probably not a blocker [13:47:59] looking [13:48:28] ah I see log traffic in #-labs, chasemp was doing something? [13:50:10] re: the RB 5xx, the timeframe of them is roughly 11:26 -> 13:10 if you're correlating in other logs [13:50:20] restbase1015-a seems coming up (still showing down in icinga) [13:50:27] yep sorry for the bad timing, not intentional, andrewbogott is doing thigns in nova atm and hopefully we'll be back shortly [13:50:58] paravoid: note that the message is TTL exceeded. routing problems [13:51:14] I can see labnet1001 sending back to cr2-eqiad and creating a loop [13:51:14] bblack: so it subsided 40 minutes ago? [13:51:23] I guess it's related to the nova changes [13:51:55] urandom: yes, the user-facing 5xx did [13:52:00] OK [13:52:09] back around the start of that window we see this from this channel: [13:52:12] 11:27 < icinga-wm> PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 400 (expecting: 200) [13:52:15] <_joe_> *EVERYONE* please DO NOT merge puppet changes from now on until we say otherwise [13:52:17] 11:27 < icinga-wm> PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:21] 11:28 < _joe_> uhm [13:52:22] 11:28 < icinga-wm> PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:52:37] wtf.... [13:52:37] so first alerts were mobileapps in eqiad, then we see RB in codfw having an alert [13:53:14] <_joe_> mobileapps depends on restbase, which calls mobileapps too [13:53:29] <_joe_> a nice loop dependency, not for the faint at heart [13:53:38] urandom: what's the status? [13:53:54] urandom: the failover window starts in 7 minutes, so I have to make a go/no-go call [13:53:57] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [13:54:33] routing loop for tools gone... [13:54:38] :-) [13:54:39] paravoid: it looks like everything has settled down [13:54:45] OK, thanks [13:54:53] that 5xxs in question, were probably because of the downed nodes there [13:54:55] ack to proceed then? [13:54:57] that correlates [13:54:59] i think so [13:55:06] ok, that's my feeling as well, thanks [13:55:19] <_joe_> paravoid: can we declare this a GO? [13:55:27] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 0 [13:56:00] _joe_: ^ above is a temp-depooled upload cache, I'm going to repool it now that it's recovered (this is routine stuff, but FYI!) [13:56:11] <_joe_> bblack: yes I noticed [13:56:13] ok [13:56:50] 06Operations, 10ops-eqiad, 10netops: switchover icinga.wikimedia.org from einsteinium to tegmen - https://phabricator.wikimedia.org/T163324#3193573 (10akosiaris) [13:57:05] urandom: 1015-a is coming up now [13:57:07] RECOVERY - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-a valid until 2017-09-12 15:34:34 +0000 (expires in 146 days) [13:57:15] elukey: yeah [13:57:23] hmm actually the cp2014 stats haven't fully recovered yet to repool it (just the one that icinga monitors) [13:57:24] I am stopping all prewarmap queries I have been doing these past days [13:57:47] elukey: that went down for too many file descriptors during the repair [13:57:47] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.138 port 9042 [13:57:56] super [13:57:59] elukey: that i can deal with [13:58:46] urandom: we'll need to dig a bit more into system_auth issues (probably?), but for the moment we should be good :) [13:58:49] icinga is full of restbase/mobileapps alerts, 1/3 & 2/3 [13:59:11] rb2005/2008 + 1009-1005 + 1018 [13:59:22] gone now [13:59:35] they failed once or twice and we report it on IRC at 3 times [13:59:56] so they're gone now, but something happened there definitely [13:59:56] 06Operations, 10ops-eqiad, 10netops: switchover icinga.wikimedia.org from einsteinium to tegmen - https://phabricator.wikimedia.org/T163324#3193593 (10akosiaris) a:05Cmjohnson>03akosiaris [14:00:11] ok, repooling cp2014 (should be harmless, cache_upload -related) [14:00:18] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp2014.codfw.wmnet,service=varnish-be [14:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] ok [14:03:39] (03CR) 10Volans: [C: 031] cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 (owner: 10Giuseppe Lavagetto) [14:03:43] switchover is starting is in a couple of minutes [14:03:53] \o/ [14:03:57] <_joe_> I'm merging the mw-config patch now [14:04:14] (thanks for everything urandom and sorry for the added pressure) [14:04:16] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [14:04:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [14:04:27] no worries [14:04:30] to repeat if it was unclear: [14:04:47] *from this point forward, no changes to anything from anyone unless it's part of the procedure for the switchover* [14:04:55] <_joe_> now waiting for jenkins for mw-config [14:05:16] ok, let's start the switchover procedure [14:05:21] paravoid: ^^^ [14:05:22] +1 [14:05:24] ack [14:05:26] please go ahead [14:05:29] <_joe_> godog: I think you can stop swiftrepl then [14:05:36] kk, doing [14:05:46] (03Merged) 10jenkins-bot: Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [14:05:53] !log switchdc (volans@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on selected hosts [14:05:56] (03CR) 10jenkins-bot: Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [14:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] !log switchdc (volans@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed [14:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:16] !log switchdc (volans@sarin) START TASK - switchdc.stages.t00_reduce_ttl(eqiad, codfw) Reduce the TTL of all the MediaWiki discovery records [14:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:27] !log switchdc (volans@sarin) END TASK - switchdc.stages.t00_reduce_ttl(eqiad, codfw) Successfully completed [14:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:35] !log stop swiftrepl on ms-fe1005 for codfw switchover [14:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:23] !log switchdc (volans@sarin) START TASK - switchdc.stages.t01_stop_maintenance(eqiad, codfw) Stop MediaWiki maintenance in the old master DC [14:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:13] 06Operations, 10ops-eqiad, 10netops: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 - https://phabricator.wikimedia.org/T163326#3193640 (10akosiaris) [14:08:17] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 790797 [14:08:37] PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time [14:08:47] PROBLEM - Check Varnish expiry mailbox lag on cp2026 is CRITICAL: CRITICAL: expiry mailbox lag is 612417 [14:08:47] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:08:47] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:08:47] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time [14:08:54] <_joe_> this is expected ^^ [14:08:57] PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time [14:08:57] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:08:57] PROBLEM - HHVM jobrunner on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:08:58] ok, good [14:09:03] when starting the next task it will starts the read-only period [14:09:05] and the mailbox lags can be ignored for now, unrelated [14:09:13] ack [14:09:14] I see the decrese in db load as expected [14:09:17] !log switchdc (volans@sarin) END TASK - switchdc.stages.t01_stop_maintenance(eqiad, codfw) Successfully completed [14:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:37] PROBLEM - Check systemd state on mw1303 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:37] PROBLEM - Check systemd state on mw1304 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:37] PROBLEM - Check systemd state on mw1300 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:37] PROBLEM - Check systemd state on mw1167 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:37] RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.009 second response time [14:09:38] PROBLEM - Check systemd state on mw1305 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:43] one process remained on terbium, manually cleaning [14:09:47] kinda expected [14:09:47] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [14:09:47] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.008 second response time [14:09:47] PROBLEM - Check systemd state on mw1164 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:47] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [14:09:57] RECOVERY - HHVM jobrunner on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.003 second response time [14:09:57] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [14:09:57] PROBLEM - Check systemd state on mw1163 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:57] PROBLEM - Check systemd state on mw1299 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:57] RECOVERY - HHVM jobrunner on mw1164 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.003 second response time [14:10:07] PROBLEM - Check systemd state on mw1306 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:07] PROBLEM - Check systemd state on mw1302 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:07] PROBLEM - Check systemd state on mw1301 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:17] PROBLEM - Check systemd state on mw1162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:17] PROBLEM - Check systemd state on mw1166 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:17] PROBLEM - Check systemd state on mw1165 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:18] PROBLEM - Check systemd state on mw1161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:11:21] are the recoveries normal, too? [14:11:32] _joe_: checking all good on maintenance hosts [14:11:47] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 738446 [14:12:24] <_joe_> mediawiki-confifg patch is merged on tin [14:12:44] ok going read-only now... [14:12:49] +1! [14:12:50] ack! [14:12:53] !log switchdc (volans@sarin) START TASK - switchdc.stages.t02_start_mediawiki_readonly(eqiad, codfw) Set MediaWiki in read-only mode (db_from config already merged and git pulled) [14:12:54] !log switchdc (volans@sarin) MediaWiki read-only period starts at: 2017-04-19 14:12:54.007017 [14:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:21] <_joe_> scap is running now [14:13:39] yes, I can still edit on enwiki [14:13:50] gehel: :) [14:14:10] papaul: yep? [14:14:24] the first graph here should eventually drop around zero too https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=1m&orgId=1&from=now-1h&to=now [14:14:25] !log root@tin Synchronized wmf-config/db-eqiad.php: Set MediaWiki in read-only mode in datacenter eqiad (duration: 01m 29s) [14:14:25] !log switchdc (volans@sarin) END TASK - switchdc.stages.t02_start_mediawiki_readonly(eqiad, codfw) Successfully completed [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] s1 master has dropped most of its connections indeed [14:14:46] Something went wrong on edit [14:14:49] confirmed read-only from an end-user perspective on enwiki [14:14:52] the message is bad, but it work [14:14:55] *works [14:14:58] as expected [14:15:01] <_joe_> ok [14:15:01] !log switchdc (volans@sarin) START TASK - switchdc.stages.t03_coredb_masters_readonly(eqiad, codfw) set core DB masters in read-only mode [14:15:05] !log switchdc (volans@sarin) END TASK - switchdc.stages.t03_coredb_masters_readonly(eqiad, codfw) Successfully completed [14:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:07] jynus: what do you mean by bad? [14:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:12] will show later [14:15:15] ack [14:15:16] lets continue [14:15:26] <_joe_> ok [14:15:30] !log switchdc (volans@sarin) START TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) wipe and warmup caches [14:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:39] this task takes ~5 minutes [14:15:39] it is probably the fancy new source editor [14:15:48] (03PS4) 10Giuseppe Lavagetto: cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 [14:16:10] <_joe_> I am going to merge the puppet patches but not puppet-merge them until volans says it's a go [14:16:55] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 (owner: 10Giuseppe Lavagetto) [14:17:01] thinks look nominal on db side [14:17:15] (03PS3) 10Giuseppe Lavagetto: discovery::app_routes: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346321 [14:17:25] some errors on "User::loadFromDatabase" [14:17:35] but only saying "the db is in read only" [14:17:39] <_joe_> warmup script is running [14:17:46] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] discovery::app_routes: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346321 (owner: 10Giuseppe Lavagetto) [14:17:51] yeah, and not many errors actually [14:18:06] people cannot register new accounts while on read only [14:18:08] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1931 bytes in 0.262 second response time [14:18:22] <_joe_> that's expected too ^^ [14:18:25] probably a workflow details that could be done better on mediawiki [14:18:26] global warmup completed, doing the second one [14:18:32] <_joe_> first phase of warmup is ok [14:18:45] ok for me to go rw [14:18:46] edit.success has drained as well [14:18:47] RECOVERY - Check Varnish expiry mailbox lag on cp2026 is OK: OK: expiry mailbox lag is 18 [14:18:50] <_joe_> merging the puppet patches [14:18:53] ack [14:19:04] MW is still on eqiad jynus :D [14:19:11] yes, sorryt [14:19:17] I meant first the actual switch [14:19:33] warmup still running right? [14:19:37] <_joe_> yes [14:19:41] <_joe_> it will log its end [14:19:56] <_joe_> or you can see sarin:/var/log/switchdc-extended.log [14:20:02] <_joe_> for the whole log [14:20:08] I am already [14:20:15] <_joe_> or sarin:/var/log/switchdc.log for the short version [14:21:14] no error reports on phabricator [14:21:17] !log switchdc (volans@sarin) END TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) Successfully completed [14:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:24] awesome [14:21:41] !log switchdc (volans@sarin) START TASK - switchdc.stages.t05_switch_datacenter(eqiad, codfw) Switch MediaWiki configuration to the new datacenter [14:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:01] scap sync [14:22:03] <_joe_> scap is running [14:22:04] !log root@tin Synchronized wmf-config/CommonSettings.php: Switch MediaWiki active datacenter to codfw (duration: 00m 19s) [14:22:08] !log switchdc (volans@sarin) END TASK - switchdc.stages.t05_switch_datacenter(eqiad, codfw) Successfully completed [14:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] seeing the warmup effects on codfw load [14:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:26] looks good [14:22:38] !log switchdc (volans@sarin) START TASK - switchdc.stages.t05_switch_traffic(eqiad, codfw) Switch traffic flow to the appservers in the new datacenter [14:22:41] <_joe_> dns switched correctly [14:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:48] <_joe_> traffi is switching now [14:22:52] ack! [14:23:02] still no significant errors on logstash [14:24:00] I see an influx of 503s for api [14:24:06] tailing the oxygen logs that is [14:24:10] yes, I see that [14:24:27] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:24:33] I see a bunch of DBConnectionErrors for at least 10.192.48.18 & 10.192.32.105 [14:24:37] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [14:24:38] PROBLEM - Check health of redis instance on 6379 on mc1009 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 610558 keys, up 26 days 6 hours [14:24:45] load is growing on codfw [14:24:46] db2045 & db2066 [14:24:47] PROBLEM - Check health of redis instance on 6381 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7443987 keys, up 26 days 21 hours [14:24:49] <_joe_> redis is expected [14:24:54] hello Redis :) [14:24:57] PROBLEM - Check health of redis instance on 6379 on mc1016 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 859238 keys, up 26 days 6 hours [14:24:57] PROBLEM - Check health of redis instance on 6380 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2834420 keys, up 26 days 22 hours [14:24:57] elastic@codfw is starting to serve queries [14:25:07] PROBLEM - Check health of redis instance on 6378 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705607 keys, up 26 days 21 hours [14:25:08] PROBLEM - Check health of redis instance on 6381 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2834473 keys, up 26 days 22 hours [14:25:09] !log switchdc (volans@sarin) END TASK - switchdc.stages.t05_switch_traffic(eqiad, codfw) Successfully completed [14:25:15] <_joe_> redis will be ok with the next task [14:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:16] big spike on codfw s1 slaves [14:25:17] PROBLEM - Check health of redis instance on 6379 on mc1001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 686751 keys, up 26 days 6 hours [14:25:17] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:17] PROBLEM - Check health of redis instance on 6378 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4 keys, up 26 days 22 hours [14:25:18] recentchanges is a bit stuck, with 1 minute queries [14:25:27] PROBLEM - Check health of redis instance on 6379 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 2834622 keys, up 26 days 22 hours [14:25:27] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:25:27] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:27] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:27] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:27] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:27] jynus, marostegui: ^ db2045 & db2066 "Can't connect to MySQL server" [14:25:31] checking [14:25:33] !log switchdc (volans@sarin) START TASK - switchdc.stages.t06_redis(eqiad, codfw) Switch the Redis replication [14:25:37] PROBLEM - Check health of redis instance on 6380 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 7534539 keys, up 26 days 21 hours [14:25:37] PROBLEM - Check health of redis instance on 6379 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7535114 keys, up 26 days 21 hours [14:25:37] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:37] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:37] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:37] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:37] PROBLEM - Check health of redis instance on 6379 on mc1012 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 645461 keys, up 26 days 6 hours [14:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:38] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:38] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:39] !log switchdc (volans@sarin) END TASK - switchdc.stages.t06_redis(eqiad, codfw) Successfully completed [14:25:39] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:39] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:40] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:57] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:57] RECOVERY - Check health of redis instance on 6379 on mc1016 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 846398 keys, up 26 days 6 hours - replication_delay is 0 [14:26:05] <_joe_> redis is done [14:26:07] PROBLEM - Check health of redis instance on 6380 on rdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [14:26:15] <_joe_> I'd pause before going furter [14:26:17] PROBLEM - Check health of redis instance on 6381 on rdb1001 is CRITICAL: CRITICAL: replication_delay is 1492611970 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2833933 keys, up 26 days 21 hours - replication_delay is 1492611970 [14:26:17] RECOVERY - Check health of redis instance on 6379 on mc1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 679123 keys, up 26 days 6 hours - replication_delay is 0 [14:26:17] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:17] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:17] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:18] RECOVERY - Check health of redis instance on 6378 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4 keys, up 26 days 22 hours - replication_delay is 6 [14:26:20] next step is to put DB back RW, jynus marostegui advice [14:26:26] volans: wait a sec [14:26:27] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:26:27] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [14:26:27] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [14:26:27] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [14:26:28] yes please hold [14:26:34] until we get ack from jynus/marostegui [14:26:38] db2045 is kinda dead 1125 load [14:26:46] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [14:26:46] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [14:26:46] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [14:26:47] RECOVERY - Check health of redis instance on 6379 on mc1012 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 638909 keys, up 26 days 6 hours - replication_delay is 0 [14:26:47] RECOVERY - Check health of redis instance on 6379 on mc1009 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 604413 keys, up 26 days 6 hours - replication_delay is 0 [14:26:47] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [14:26:47] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:47] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:48] db2045 seems temporary opverload [14:26:50] seems ok now [14:27:06] RECOVERY - Check health of redis instance on 6378 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705607 keys, up 26 days 21 hours - replication_delay is 5 [14:27:06] RECOVERY - Check health of redis instance on 6380 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2812198 keys, up 26 days 21 hours - replication_delay is 0 [14:27:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:27:12] load decreasing, fast, from 1000 to 750 now [14:27:13] 503s on api have subsized [14:27:16] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [14:27:16] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [14:27:16] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [14:27:16] RECOVERY - Check health of redis instance on 6381 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2812985 keys, up 26 days 21 hours - replication_delay is 0 [14:27:16] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:27:16] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:27:26] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:27:26] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [14:27:26] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [14:27:26] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:27:26] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [14:27:26] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [14:27:26] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [14:27:27] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [14:27:27] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:27:28] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [14:27:29] exceptions for both have died down a little bit [14:27:36] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [14:27:36] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [14:27:36] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:27:36] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [14:27:36] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [14:27:36] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [14:27:36] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:27:37] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:27:37] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [14:27:38] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:27:38] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [14:27:39] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [14:27:43] db2066 ok, too, paravoid [14:27:44] mobileapps/restbase is known to be prone to API failures [14:27:46] <_joe_> should we proceed? [14:27:50] <_joe_> yes [14:27:51] we have a bit of overload [14:27:52] jynus: ack to proceed? [14:27:54] but I think it is ok [14:27:56] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [14:27:56] PROBLEM - Check health of redis instance on 6381 on rdb1005 is CRITICAL: CRITICAL: replication_delay is 1492612069 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2832154 keys, up 26 days 21 hours - replication_delay is 1492612069 [14:28:12] jynus: how sure are you? :) [14:28:14] <_joe_> elukey: can you check redis^^ [14:28:16] 99% [14:28:23] no replicaiton problems [14:28:24] <_joe_> so let's make codfw rw? [14:28:26] ok [14:28:26] PROBLEM - MD RAID on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:30] _joe_ ack [14:28:31] ok proceeding [14:28:33] volans: go [14:28:36] PROBLEM - Check health of redis instance on 6380 on rdb1005 is CRITICAL: CRITICAL: replication_delay is 1492612112 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2831684 keys, up 26 days 21 hours - replication_delay is 1492612112 [14:28:41] !log switchdc (volans@sarin) START TASK - switchdc.stages.t07_coredb_masters_readwrite(eqiad, codfw) set core DB masters in read-write mode [14:28:45] !log switchdc (volans@sarin) END TASK - switchdc.stages.t07_coredb_masters_readwrite(eqiad, codfw) Successfully completed [14:28:46] what's up with redis? [14:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] paravoid, it is getting better and better as we speak [14:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:00] paravoid, jynus ok to put MW RW? [14:29:04] paravoid: probably getting up to speed with the replication [14:29:04] <_joe_> paravoid: replication lag codfw => eqiad [14:29:07] DB are already RW [14:29:23] db2070 has also recovered from the initial spike [14:29:27] shall we hold for redis? [14:29:29] <_joe_> let's do mediawiki [14:29:34] <_joe_> paravoid: absolutely not [14:29:41] ok proceeding [14:29:42] heh [14:29:43] go [14:29:45] !log switchdc (volans@sarin) START TASK - switchdc.stages.t08_stop_mediawiki_readonly(eqiad, codfw) Set MediaWiki in read-write mode (db_to config already merged and git pulled) [14:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:58] (checking redis in the meantime) [14:30:03] the important part is that there is no consistency issues [14:30:04] this is also a scap [14:30:05] !log root@tin Synchronized wmf-config/db-codfw.php: Set MediaWiki in read-write mode in datacenter codfw (duration: 00m 18s) [14:30:05] !log switchdc (volans@sarin) MediaWiki read-only period ends at: 2017-04-19 14:30:05.678665 [14:30:05] !log switchdc (volans@sarin) END TASK - switchdc.stages.t08_stop_mediawiki_readonly(eqiad, codfw) Successfully completed [14:30:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [14:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:11] back RW [14:30:16] RECOVERY - MD RAID on rdb2005 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [14:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:37] I can edit fine [14:30:39] is it safe to edit again? [14:30:46] RECOVERY - Check health of redis instance on 6381 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4614223 keys, up 26 days 21 hours - replication_delay is 8 [14:30:47] <_joe_> TBloemink: it should be [14:30:47] all servers ok [14:30:54] replication looking good - no lag [14:30:55] but overload on s1-2065 [14:30:56] RECOVERY - Check health of redis instance on 6381 on rdb1005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2812043 keys, up 26 days 21 hours - replication_delay is 0 [14:30:57] so far the only significante public 5xx spike was from 14:22 -> 14:28 (already over) [14:30:59] *s4 [14:31:07] bblack: api, right? [14:31:10] I can edit! [14:31:15] most likely api, yes [14:31:17] redis replication is catching up [14:31:23] ok [14:31:29] are jobs running now- I would wait a bit [14:31:36] RECOVERY - Check health of redis instance on 6380 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4709569 keys, up 26 days 21 hours - replication_delay is 7 [14:31:36] RECOVERY - Check health of redis instance on 6379 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708681 keys, up 26 days 21 hours - replication_delay is 1 [14:31:36] PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused [14:31:37] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_restore_ttl(eqiad, codfw) Restore the TTL of all the MediaWiki discovery records [14:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:46] PROBLEM - cassandra-c service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [14:31:48] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_restore_ttl(eqiad, codfw) Successfully completed [14:31:49] I can see writes starting to happen on grafana aggregated [14:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:56] PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:31:57] ack [14:31:58] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_start_maintenance(eqiad, codfw) Start MediaWiki maintenance in the new master DC [14:31:59] * urandom sighs, looking ^^^ [14:32:00] load is going down [14:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:06] <_joe_> let's start the jobrunners in codfw [14:32:06] the restbase stuff are unrelated I'm sure [14:32:06] PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:32:08] so proceed [14:32:26] RECOVERY - Check health of redis instance on 6379 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 2814390 keys, up 26 days 22 hours - replication_delay is 0 [14:32:41] <_joe_> paravoid: at this point the main task is what we're doing now [14:32:54] <_joe_> so restarting jobrunners and crons in codfw [14:32:58] jynus: es servers are looking fine so far :) [14:33:01] <_joe_> it will take a few minutes more [14:33:04] yup [14:33:06] marostegui, yes [14:33:06] RECOVERY - Check health of redis instance on 6380 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2814055 keys, up 26 days 22 hours - replication_delay is 0 [14:33:10] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_start_maintenance(eqiad, codfw) Successfully completed [14:33:16] RECOVERY - Check health of redis instance on 6381 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7894 keys, up 26 days 22 hours - replication_delay is 0 [14:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:18] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_tendril(eqiad, codfw) Update Tendril configuration for the new masters [14:33:20] but we may need to rebalance some main servers [14:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:33] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_tendril(eqiad, codfw) Successfully completed [14:33:36] RECOVERY - Check health of redis instance on 6380 on rdb1005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2811148 keys, up 26 days 21 hours - replication_delay is 0 [14:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:46] RECOVERY - cassandra-c service on restbase1015 is OK: OK - cassandra-c is active [14:33:47] RECOVERY - Check health of redis instance on 6379 on rdb1005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 6344 keys, up 26 days 21 hours - replication_delay is 0 [14:33:49] consistency looks good, no lag [14:34:04] yep fatalmonitor mostly shows errors/timeouts for rdb2005 [14:34:06] RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational [14:34:11] yeah, small spikes but nothing worrying or unusual  \o/ [14:34:17] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [14:34:19] es load and some server is hich [14:34:25] *high [14:34:26] also something I'm puzzled about, 34 Failed connecting to redis server at 10.192.0.34: Connection timed out in /srv/mediawiki/php-1.29.0-wmf.20/incl [14:34:32] *server's [14:34:32] udes/libs/redis/RedisConnectionPool.php on line 235 [14:34:41] that is scb2005's address [14:34:46] eqiad latency for elasticsearch is not an issue [14:34:51] The writes traffic has almost reached leves prior the switchover already [14:34:51] marostegui, so this time we did better on es* but worse on main traffic [14:34:56] <_joe_> godog: that is an error in some config [14:34:59] PURGE traffic has surged back in, I assume from jobrunner [14:35:03] <_joe_> godog: can you check mediawiki-config? [14:35:05] <_joe_> bblack: yes [14:35:25] _joe_: yep I'll check that first [14:35:30] I confirm I can edit and recentchanges looks ok, at least on enwiki [14:35:51] <_joe_> godog: yes, there is an error [14:35:54] <_joe_> godog: fixing it [14:36:00] sigh, again? [14:36:13] The Redis replication is still ongoing and it will take a bit for some instances (known issue :( [14:36:18] <_joe_> paravoid: apparently some things have changed and nobody checked/updated [14:36:36] some api queries are slow [14:36:50] indeed it is the redis lockmanager, wrong IPs [14:36:55] holding on the parsoid restart for a bit (as already agreed with joe), is the last step [14:36:57] RECOVERY - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-c valid until 2017-09-12 15:34:41 +0000 (expires in 146 days) [14:37:16] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [14:37:19] . [14:37:36] RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.140 port 9042 [14:37:42] mmm checking mw2256 [14:37:45] <_joe_> ok so I know what happened [14:37:53] <_joe_> fixing [14:37:55] <_joe_> shit [14:38:03] what? [14:38:04] Ok, FYI, Commons is getting edits, but we cannot delete files. [14:38:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:38:16] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1875 bytes in 0.255 second response time [14:38:18] Revent: what is the error you're getting? [14:38:25] A database query error has occurred. This may indicate a bug in the software.[WPd1@QrAID0AAGIq6fkAAABF] 2017-04-19 14:36:51: Fatal exception of type "DBTransactionError" [14:38:32] new rule: noone says "shit" without immediately explaining why [14:38:33] _joe_: LMK how I can help [14:38:48] Revent, let me see that error [14:38:56] jynus: thanks [14:39:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:39:10] I bet it is related to redis lockmanager pointing to the wrong ips in codfw [14:39:11] Revent: joe is working on the fix, is related to redis [14:39:12] elukey: I think mw2256 has died.. console doesn't spew out anything meaningful [14:39:15] kk [14:39:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:39:29] Revent: thanks! [14:39:34] akosiaris: ah ok I was about to ask who was holding the console :D [14:39:37] (03PS1) 10Giuseppe Lavagetto: Fix lock redis addresses after refresh of servers in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348944 [14:39:40] error levels are still high on User:loadFromDatabase [14:39:41] <_joe_> Revent: ^^ [14:39:49] elukey: I 'll powercycle [14:40:01] <_joe_> elukey: can you check that patch? [14:40:06] akosiaris: that host had a faulty bank of RAM that we replaced [14:40:08] <_joe_> it's mc2019-21 [14:40:12] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3193747 (10ayounsi) >if possible to migrate kafka1022 I believe you mean kafka1020 > any issue from the network capabilities perspective to move a kafka node in row... [14:40:14] _joe_: sure [14:40:17] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:40:46] silver/labswiki tries to connect to db1053 fwiw [14:41:16] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:41:17] rdb2005 is gone from fatalmonitor afaics [14:41:18] _joe_ +1 [14:41:33] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix lock redis addresses after refresh of servers in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348944 (owner: 10Giuseppe Lavagetto) [14:41:36] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:38] !log powercycle mw2256 [14:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:45] elastic2020 lol [14:41:47] :) [14:41:47] gehel: ^^^ [14:41:48] (03CR) 10jenkins-bot: Fix lock redis addresses after refresh of servers in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348944 (owner: 10Giuseppe Lavagetto) [14:41:51] the saga continues [14:41:51] Uploading on Commons probably not works too? [14:42:30] I see rows being written to eqiad right now? [14:42:35] wargo: Last upload was at 14:14, 2017 April 19 [14:42:42] maybe it is just replication [14:42:46] jynus: does that count replication? [14:42:48] yeah, that [14:42:54] I can, however, block people. :P [14:42:56] volans: yep, elastic2020 is not fixed... cc papaul [14:43:03] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops, 13Patch-For-Review: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3193765 (10Jgreen) >>! In T162469#3172320, @Papaul wrote: > @Jgreen Complete let me know if you have any questions. @Papaul the management interface isn't accessible,... [14:43:05] elastic2020 is powered off according to ILO [14:43:17] gehel: told you, replace the CPUs :p [14:43:29] gehel: so just a lucky out of downtime? [14:43:34] joe can we merge the mw_primary change on puppet? [14:43:36] ah, T149006 [14:43:36] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [14:43:43] jynus: was alrady merged [14:43:45] mw2256 issues garbled text at the console.. looks like baud rate misconfiguration [14:43:45] <_joe_> jynus: it's merged AFAIK [14:43:47] ah, nice [14:43:49] thanks [14:43:51] <_joe_> Revent: can you try to delete a file now? [14:43:57] And Commons uploads are working, sec [14:43:59] as Riccardo suggested (I haven't realized) the issue with Redis Lock managers is my fault, I replaced mc2* in codfw without grepping their IPs in mw-config [14:44:08] <_joe_> akosiaris: mw2256 - if it comes up, please sync it [14:44:09] sorry people :( [14:44:15] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3193773 (10Gehel) elastic2020 crashed again after DC switch. Back to investigations... [14:44:15] at least it happily continues issuing garbled text... that's something [14:44:16] Commons uploads look fine. [14:44:17] _joe_: ok [14:44:18] elukey: I was just an ambassador ;) [14:44:18] _joe_: Success [14:44:26] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [14:44:26] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 729775 [14:44:32] <_joe_> elukey: I already assigned blame where it was due [14:44:34] mw2256 is back up [14:44:38] <_joe_> Revent: \o/ [14:44:41] alright [14:44:43] thanks _joe_ [14:44:48] and godog for noticing that :) [14:44:50] !log oblivian@tin Synchronized wmf-config/ProductionServices.php: Fix redis locks (duration: 02m 24s) [14:44:52] good catch! [14:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:02] ok, let's proceed with the last step? [14:45:07] np :)) [14:45:16] <_joe_> paravoid: what last step? :P [14:45:22] top fatalmonitor error now is 52 unable to connect to unix:///var/run/nutcracker/redis_codfw.sock [2]: No such file or directory [14:45:22] parsoid restart [14:45:22] parsoid? [14:45:24] <_joe_> parsoid restart? [14:45:25] <_joe_> ok [14:45:36] <_joe_> godog: uhm [14:45:38] <_joe_> that's bad [14:45:41] <_joe_> elukey: ^^ [14:45:51] it is, but let's do parsoid anyway [14:45:59] weird [14:45:59] as is unrelated [14:46:00] I'll take a look as well [14:46:12] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_restart_parsoid(eqiad, codfw) Rolling restart parsoid in eqiad and codfw [14:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:23] paravoid: proceeding [14:46:31] thanks volans [14:46:40] jynus: I am a bit concerned about API slaves still (see db2062 or db2069) [14:46:51] it's a rolling one with 15s sleep between every restart so it takes a while [14:46:58] !log banning elastic2020 from codfw cluster - T149006 [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:06] ^ just in case... [14:47:08] jynus: https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=db2062&var-network=eth0 [14:47:15] <_joe_> godog: it's properly configured eveywhere AFAICT [14:47:19] marostegui there is not much we can do except kill long running queries faster [14:47:29] apis query non-hot rows [14:47:34] _joe_: yeah I don't see the error anymore in fatalmonitor, probably transient [14:47:36] godog: is all mw2256 related? (nutcracker [14:47:40] and they create long running queries even on eqiad [14:47:43] <_joe_> yes, I think it is [14:47:47] disk is full [14:47:54] on db2062 [14:48:01] ugh [14:48:05] what? [14:48:07] <_joe_> ow [14:48:08] oh no [14:48:12] no [14:48:16] not full [14:48:21] /dev/mapper/tank-data 3.3T 1.3T 2.1T 38% /srv [14:48:25] 100% "utilized" means iops I guess [14:48:28] yes [14:48:30] we have a 1TB altert [14:48:32] :-) [14:48:35] almost [14:48:42] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=apache2 [14:48:43] yeah 100% in iops [14:48:47] how is it getting a percentage of io utilization? [14:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:50] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=nginx [14:48:53] sda 0.00 2.00 2980.60 830.40 47689.60 79764.10 66.89 44.04 11.57 14.77 0.07 0.26 100.00 [14:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:03] this is "iostat -k -x 5" [14:49:06] I can start killing bad queries [14:49:21] ok, makes sense [14:49:22] maybe it is missing a good query killer [14:49:38] jynus: db2062 at 59.05 loadavg [14:49:45] bblack: iirc time spent servicing io requests over the interval [14:49:56] volans: yeah, it has been hanging around that all the time [14:50:12] "SELECT /* AFComputedVariable::{closure} */ rev_user_text FROM `revision`" [14:50:22] maybe it has a different schema? [14:50:30] let me see [14:50:43] it does [14:50:54] it is partitioned but it lacks the rev_id links [14:51:01] is it only 62? [14:51:06] why is it partitioned? it is not an rc slave [14:51:25] it is api, true [14:51:32] volans: parsoid restart surely is slow [14:51:33] db2069 the other api isn't partitioned [14:51:53] db2062 revision table looks good to me [14:51:58] with all the correct indexes and PK [14:52:12] batch_size 1 is probably a bit too paranoid [14:52:13] <_joe_> paravoid: it is on purpose [14:52:18] mw2256 synced I 'll repool it [14:52:23] paravoid: are ~40 hosts, 15s sleep one by one, so expect a bit more than 10m [14:52:24] <_joe_> paravoid: that's the way they do it for deploys [14:52:38] db2062 looks the same as db1072 in eqiad regarding revision table [14:52:44] <_joe_> paravoid: this is absolutely not critical, could be done tomorrow FWIW [14:53:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet,service=nginx [14:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet,service=apache2 [14:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:39] why do we restart parsoid again? [14:53:43] marostegui, do you see the problem, then, I cannot [14:53:49] <_joe_> paravoid: http vs https [14:53:53] Hey folks, can someone add a link to https://wikitech.wikimedia.org/wiki/Switch_Datacenter to the topic? [14:53:59] took me a bit of searching to find it :) [14:54:03] <_joe_> the jobqueue has just re-enqueued 5 million jobs [14:54:14] _joe_ memcached looks good - https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats [14:54:25] _joe_, that is the main issue, I think [14:54:33] <_joe_> but the good news is it's being processed [14:54:35] for parsoid, I can increse the batch size and restart it [14:54:41] load goes from 0, not to normal, but to higher than normal [14:54:52] <_joe_> jynus: jobqueue has something to do with API dbs? [14:54:53] mobrovac: is already half way through the restart [14:54:54] no need [14:55:00] kk [14:55:15] content translation errors are high [14:55:15] what's next? TTL? [14:55:22] paravoid: no, all done [14:55:26] user::loadfrom database errors are high [14:55:26] ah [14:55:37] Stale template error files present for '/var/lib/gdnsd/discovery-api-rw.state' etc. errors [14:55:45] parsoid was left as the last because long and not critical [14:55:45] <_joe_> paravoid: thats expected [14:55:47] on baham/eeden/radon [14:55:51] <_joe_> we did a "dirty" switch [14:56:09] <_joe_> it's one of the things I used to verify it was working, I'm removing those stale files now [14:56:22] ok [14:56:27] what about puppet on mw1xxx? [14:56:32] still disabled afaik [14:56:45] checking [14:56:48] <_joe_> uhm, it shouldn't be, maybe I forgot to add it to switchdc [14:56:59] <_joe_> volans: we found 1 bug :P [14:57:03] yeah, jobrunners [14:57:06] the inverse of step 1 [14:57:08] and possibly videoscalers? [14:57:11] heh [14:57:21] also systemd degraded, for a similar reason [14:57:37] it's the jobcrons.. the moment puppet runs those should be fixed [14:57:38] well, I actually asked for it to be hold down for some time [14:57:47] <_joe_> akosiaris: yes [14:57:51] because so much work ongoing [14:58:03] hold what? [14:58:11] _joe_: what are you talking about with removing stale files? [14:58:27] <_joe_> bblack: confd, whenever it finds an invalid config [14:58:28] jobs and scalers and other async jobs [14:58:37] <_joe_> saves a file in /var/run/confd-template [14:58:45] <_joe_> so that you know when it last failed [14:58:51] ok [14:58:52] there is lot of contention on x1 for content translation [14:58:56] got it [14:59:01] lots of blocks there [14:59:35] confirmed ContentTranslation\TranslationStorageManager::{closure} in fatalmonitor slow queries [14:59:53] other than that, there is general slow stuff [14:59:58] ok, let's triage [15:00:04] what are the issues everyone is aware of right now? [15:00:17] api slaves being overloaded [15:01:03] marostegui, it is getting better [15:01:34] is it? db2062 still looks quite unhappy: https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=db2062&var-network=eth0&refresh=1m [15:02:05] paravoid: the cx slowness above and missing step 1 rollback afaik [15:02:14] https://etherpad.wikimedia.org/p/codfw-switchover-AprMay2017 please [15:02:15] we can pool another server as api, as we have now on eqiad [15:02:26] <_joe_> paravoid: redis and mc seem to be ok, the jobqueue is big but recovering, cronjobs are running [15:03:05] I am checking one of the bad queries and at least it has the same query plan on either db1072 and db2062 so that is "good" [15:03:15] replication seems good on rdb100[1358] as far as I can see [15:03:42] please all report the issues that you see right now at https://etherpad.wikimedia.org/p/codfw-switchover-AprMay2017 [15:03:51] <_joe_> so the jobqueue might need some investigating, the number is indeed huge [15:04:01] jynus: pooling another server for api isn't a bad idea, we do have 3x160G on eqiad and only 2 on codfw right now [15:04:22] the mw object cache avg hit ratio is ~0.8, good considering that it is only relying on cache warmup [15:04:42] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_restart_parsoid(eqiad, codfw) Successfully completed [15:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] marostegui, we can also increase the thread_pool_size [15:05:07] but it would make things worse, I think [15:05:13] all tasks of switchdc completed [15:05:21] yes, I wouldn't touch that one [15:05:31] marostegui: there is the manual DNS patch for master aliases [15:05:40] 06Operations, 06Labs: kube-proxy pulls in docker and starts service even when it isnt needed - https://phabricator.wikimedia.org/T163336#3193909 (10chasemp) [15:05:45] jynus: we can also decrease the main traffic weight for both api servers [15:05:58] volans: yes, if you wanna go ahead and merge it, feel free [15:06:21] I can do it later if you want, no worries :) [15:06:26] 06Operations, 06Labs: kube-proxy pulls in docker and starts service even when it isnt needed - https://phabricator.wikimedia.org/T163336#3193924 (10chasemp) p:05Triage>03Normal [15:06:57] marostegui: let's do it in a bit, no hurry [15:07:05] volans: oki! [15:07:09] godog: swiftrepl [15:07:17] volans: doing [15:07:23] (03PS1) 10Jcrespo: Pool db2055 as a new API enwiki node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348947 [15:07:30] marostegui, ^ [15:07:44] (03CR) 10Marostegui: [C: 031] Pool db2055 as a new API enwiki node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348947 (owner: 10Jcrespo) [15:07:59] !log start swiftrepl on ms-fe1005 for codfw switchover [15:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:09] akosiaris: could you test phase10 that email works? [15:08:12] (03PS2) 10Jcrespo: Pool db2055 as a new API enwiki node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348947 [15:08:21] I am going to merge mediawiki-config [15:08:25] volans: ? [15:08:31] phase10 ? [15:08:31] jynus: +1 [15:08:37] akosiaris: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_10_-_verification_and_troubleshooting [15:08:44] ah yes sorry [15:08:56] ok so [15:09:08] jynus: is working on API slaves being overloaded [15:09:15] what's to be done for the x1 contention, is that better? [15:09:27] that cannot be done infrastructure wise [15:09:31] it is the code doing it [15:09:35] RECOVERY - Check systemd state on mw1161 is OK: OK - running: The system is fully operational [15:09:39] why it is beind done now, I don't know [15:09:50] maybe lots of translations got enqueued? [15:09:55] RECOVERY - Check systemd state on mw1163 is OK: OK - running: The system is fully operational [15:09:55] RECOVERY - Check systemd state on mw1299 is OK: OK - running: The system is fully operational [15:09:55] RECOVERY - Check systemd state on mw1304 is OK: OK - running: The system is fully operational [15:10:06] RECOVERY - Check systemd state on mw1302 is OK: OK - running: The system is fully operational [15:10:06] RECOVERY - Check systemd state on mw1306 is OK: OK - running: The system is fully operational [15:10:06] RECOVERY - Check systemd state on mw1301 is OK: OK - running: The system is fully operational [15:10:06] in any case, I think that is seconday because it only affects itself [15:10:16] RECOVERY - Check systemd state on mw1162 is OK: OK - running: The system is fully operational [15:10:16] RECOVERY - Check systemd state on mw1166 is OK: OK - running: The system is fully operational [15:10:16] RECOVERY - Check systemd state on mw1165 is OK: OK - running: The system is fully operational [15:10:16] RECOVERY - Check systemd state on mw1164 is OK: OK - running: The system is fully operational [15:10:18] unlike api affecting everthing [15:10:24] even non api stuff [15:10:30] (03CR) 10Jcrespo: [C: 032] Pool db2055 as a new API enwiki node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348947 (owner: 10Jcrespo) [15:10:35] volans: checked and confirmed [15:10:35] RECOVERY - Check systemd state on mw1167 is OK: OK - running: The system is fully operational [15:10:35] RECOVERY - Check systemd state on mw1303 is OK: OK - running: The system is fully operational [15:10:35] RECOVERY - Check systemd state on mw1300 is OK: OK - running: The system is fully operational [15:10:35] RECOVERY - Check systemd state on mw1305 is OK: OK - running: The system is fully operational [15:10:43] akosiaris: thanks [15:10:44] akosiaris: mww2256 console works fine for me [15:11:06] mutante: you see text ? [15:11:08] <_joe_> !log ran cumin 'R:class = role::mediawiki::jobrunner and *.eqiad.wmnet' 'systemctl reset-failed' manually [15:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:14] I would only see garbage during bootup [15:11:16] akosiaris: yes, i see the normal login [15:11:19] weird [15:11:21] nothing weird from https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=codfw [15:11:36] bblack, ema: could you take care of warnings on cp1008 for the stale gdnsd files? [15:11:57] volans, _joe_ which is the recommended deploymeny server, mira? [15:12:03] <_joe_> jynus: tin [15:12:07] thanks [15:12:14] I see the banner on mira [15:12:16] :-) [15:12:16] <_joe_> jynus: it will be switched over later [15:12:16] and then will be naos [15:12:27] thaniks [15:12:34] poor mira [15:13:02] so who can help debug the contenttranslation issue? [15:13:07] anyone from the language team around? [15:13:20] Nikerabbit: ^ [15:13:22] Nikerabbit: ? [15:13:22] (03Merged) 10jenkins-bot: Pool db2055 as a new API enwiki node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348947 (owner: 10Jcrespo) [15:13:23] santhosh: ^^^ [15:13:34] (03CR) 10jenkins-bot: Pool db2055 as a new API enwiki node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348947 (owner: 10Jcrespo) [15:13:52] paravoid: I am checking to see if there is anything mysql level that can be done [15:13:56] paravoid, basically, i see lots of SELECT FOR UPDATE, meaning it is the app blocking itself [15:13:59] thanks marostegui [15:14:02] that didn't happen before [15:14:12] yes, it is full of table lock :_( [15:15:21] _joe_: what do you want to do about the jobqueue's queue? [15:16:22] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db2055 as an additional API server (duration: 01m 02s) [15:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:58] <_joe_> paravoid: for now, nothing; I am looking into it and we're processing well, the wait times are not through the roof [15:17:13] <_joe_> so I would be inclined not to do anything for now besides monitoring it [15:17:20] +1 [15:17:35] I'm here [15:17:51] Nikerabbit: hi! [15:17:53] <_joe_> I suspect a lot of those jobs will be reaped, but I'll basically monitor it for now [15:18:04] how can I help? [15:18:06] <_joe_> paravoid: apart from that, I guess aaron can take a look [15:18:27] Nikerabbit: TL;DR, we just performed the codfw switchover, cx database queries are taking table locks and there's contention resulting into slowness [15:18:28] load is starting to go down on db2069 [15:18:36] same on db2062 [15:18:52] \o/ thanks guys [15:19:33] paravoid, apparently this confirms that the larger servers we bought are really needed :-) [15:19:43] heh [15:20:00] sadly we didn't have as much room as we thought, it increased a lot in a year [15:20:24] probably because of architecture changes (restbase, mobileapps etc.) [15:20:38] Nikerabbit: < jynus> paravoid, basically, i see lots of SELECT FOR UPDATE, meaning it is the app blocking itself [15:20:46] paravoid: okay... I assume we should look into those queries to avoid that happening in the future? [15:21:04] it's unclear why this started happening now, after the switch to codfw [15:21:14] so investigating that would be interesting [15:21:16] Nikerabbit, there is lots of traffic happening on translation tables [15:21:26] so much I am not sure it is working [15:21:31] er, s/interesting/helpful/ [15:21:44] it is not a huge issue because I do not think it affects other functionality [15:21:55] paravoid, jynus: we previously had issues that lot of pending requests could arrive to the server at once, I wonder if during the read-only similar kind of thing happened [15:21:58] but I would not exect such high traffic there [15:22:08] if it is only that, no big deal [15:22:13] but it would be nice to confirm it [15:22:38] do you have access on how to get the right logs: logstash and tendril? [15:23:03] I have access to logstash, not familiar how to use tendril [15:23:20] isn't SELECT FOR UPDATE frowned upon in general? [15:23:41] paravoid, not really, assuming the query needs it and it only takes a few miliseconds [15:23:42] (03CR) 10Alexandros Kosiaris: [C: 031] dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 (owner: 10Dzahn) [15:23:46] db2069 and db2062 have decreased their load but not completely back to a normal state [15:23:51] locking is inevitabble [15:24:06] now the point is why it is not takiing miliseconds :-) [15:24:49] hmm [15:24:51] Nikerabbit, https://tendril.wikimedia.org/activity?wikiadmin=0&research=0&labsusers=0 [15:24:58] ignore the ongoing dump user [15:25:09] and check the long running queries on cx* tables [15:25:11] Aaron worked on these queries, there is some long comments about locking I don't claim to fully understand [15:25:38] some are running for 53 minutes already, maybe I should just kill them manually? [15:26:08] is there any issue if the write is killed? [15:26:09] jynus: I don't see any in that link [15:26:19] in terms of incosistency or something that might be created? [15:26:38] Nikerabbit: look for db2033 [15:26:38] marostegui: the UI will retry if saving fails (assuming the page is still open in the client) [15:26:48] Nikerabbit: ok [15:26:56] it probably isn't, nobody is there for 53 minutes waiting [15:28:37] the one causing it is a SELECT /* ContentTranslation\TranslationStorageManager::{closure}..FOR UPDATE [15:28:38] there are lot of repeats in cxc_translation_id = XXX, which means the UI has already retried [15:28:48] blockign all writes there [15:29:10] several of them, actually [15:29:34] if we can kill those, we can unblock them, but we must be sure not to retry them or the problem will return [15:30:30] there is 2361 open connections, I would suggest to kill them all [15:31:21] any request that has been running for few minutes has already timed out to the browser, I believe, so killing those FOR UPDATE queries would not do much harm [15:31:33] I will kill them then [15:31:55] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:55] other than that, unrelated, I see some watchlist operations being slow [15:32:27] but not sure we can do much about that, normally it happens for users with lots of articles on the watchlist [15:32:28] !log mw2256 went down and showed " PANIC: double fault, error_code: 0x0" [15:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:39] even on eqiad [15:32:53] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194065 (10RobH) [15:33:20] https://github.com/wikimedia/mediawiki-extensions-ContentTranslation/blob/master/includes/TranslationStorageManager.php#L81-L119 [15:33:37] halfak: https://phabricator.wikimedia.org/T163337 [15:33:55] marostegui, it is getting worse- queries are killed but they do not disconnecty [15:33:55] I have killed them [15:34:00] I see yes [15:34:10] mutante: I think elukey said its memory was replaced before, so bad hardware -- please open (or follow-up to) a task [15:34:27] paravoid: ok [15:34:29] I'm sure there must be a better way to do saves in CX... after all there is no problem with regular page editing which has much higher frequency [15:34:34] PYBAL CRITICAL - api-https_443 - Could not depool server mw2140.codfw.wmnet because of too many down!: appservers-https_443 - Could not depool server mw2254.codfw.wmnet because of too many down! [15:34:38] wait what? [15:35:12] at this point I would disable conntent translation everywhere [15:35:20] before it affects all other functionality [15:35:23] mutante: yep let me grab the task [15:35:26] agreed, the server is not recovering [15:35:45] Nikerabbit, please disable the extension all toghether [15:35:55] or tell me how to do it [15:36:02] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3194084 (10ema) Looks like the overwhelming majority of objects ending up in bin0 (thus smaller than 16k) is made of items smaller than 1676 bytes. I've ran the... [15:36:02] server went to too many connections already [15:36:11] yes, it is creating an outage now [15:36:30] PROBLEM - MariaDB Slave IO: x1 on db2033 is CRITICAL: CRITICAL slave_io_state could not connect [15:36:35] not only for translation, but for flow and the other servers [15:36:38] I will silence that [15:36:39] wmgUseContentTranslation -> wikipedia => false [15:36:40] PROBLEM - MariaDB Slave SQL: x1 on db2033 is CRITICAL: CRITICAL slave_sql_state could not connect [15:36:41] mutante: https://phabricator.wikimedia.org/T155180#3114820 [15:36:55] we thought it was ok :( [15:36:58] in InitialiseSettings.php [15:37:14] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194090 (10RobH) Since these should be moved while the systems are under load, there is an inherent risk involved. Please do not move a system's power plugs without coordinat... [15:37:23] paravoid: was that pybal critical recent? I'm not seeing it in icinga [15:37:23] * AaronSchulz sees "Error connecting to 10.192.32.4: Too many connections" spam [15:37:35] AaronSchulz: we are on it, see a few lines above [15:37:39] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3194091 (10Dzahn) 05Resolved>03Open [15:37:47] so who is disabling wmgUseContentTranslation? [15:38:04] I'm not, but changing that line does it [15:38:08] (03PS1) 10Jcrespo: Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348951 [15:38:11] Nikerabbit, ^ [15:38:34] (03CR) 10Faidon Liambotis: [C: 031] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348951 (owner: 10Jcrespo) [15:38:41] (03CR) 10Nikerabbit: [C: 032] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348951 (owner: 10Jcrespo) [15:38:46] (03CR) 10Jcrespo: [V: 032 C: 032] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348951 (owner: 10Jcrespo) [15:38:59] (03CR) 10jenkins-bot: Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348951 (owner: 10Jcrespo) [15:39:05] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Dzahn) mw2256 died again [15:39:31] If the connections aren't freed up we might need to kill -9 the server :( [15:39:36] ugh [15:40:04] marostegui, we can failover to the slave [15:40:14] it should not create new locking [15:40:15] !log dzahn@puppetmaster2001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet [15:40:20] godog: it disappeared but definitely appeared for a moment [15:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:39] jynus: but remember the new slave's hardware isn't great [15:40:44] yeah [15:40:44] let's wait what happens once it is deployed [15:40:57] one mw host is down, right? [15:41:03] yes, mw2256 [15:41:04] sync-apaches: 99% (ok: 300; fail: 0; left: 1) [15:41:04] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3194119 (10Papaul) @Dzahn anything error log? [15:41:43] db2033 is not getting any better [15:41:52] marostegui, jynus: if I can help let me know [15:42:17] jynus: I would do a kill -9 and let the server recover, it shouldn't take too long [15:42:36] !log jynus@tin Synchronized wmf-config/InitialiseSettings.php: Disable cx_translation- it is causing an outage on x1 (duration: 02m 44s) [15:42:39] wait, the deploy has not finished yet [15:42:40] now [15:42:41] heh, so I was looking at corrupted memory while rebooting mw2256 ? nice [15:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:17] marostegui, port 3307 is reserved for admin login [15:43:20] not saturatedf [15:43:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [15:43:48] but queries do not get killed yet [15:43:54] :( [15:44:21] let's do a graceful restart [15:44:35] we'll see if it is able to stop [15:44:36] can you deploy the slave [15:44:40] with more traffic? [15:44:42] yes [15:44:45] quickly [15:44:50] doing it now [15:45:01] would this issue affect Echo as well? [15:45:05] yes [15:45:48] (03PS1) 10Marostegui: db-codfw.php: Give x1 slave more weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348952 [15:45:53] jynus: ^ [15:45:55] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [15:46:02] no, no [15:46:06] just give it 0 to the master [15:46:10] ok [15:46:28] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2020.codfw.wmnet [15:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:36] mutante: I suppose mw2256 coming back up is you ? [15:46:38] (03PS2) 10Marostegui: db-codfw.php: Give x1 slave more weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348952 [15:46:52] jynus: ^ [15:46:58] akosiaris: yes, it is. i depooled and rebooted it to look at logs [15:46:58] (03CR) 10Jcrespo: [V: 032 C: 032] db-codfw.php: Give x1 slave more weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348952 (owner: 10Marostegui) [15:47:05] ok [15:47:10] (03CR) 10jenkins-bot: db-codfw.php: Give x1 slave more weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348952 (owner: 10Marostegui) [15:47:43] deploying [15:47:46] ah [15:47:51] I was aboput to hit enter :) [15:47:56] keep an eye on the slave [15:48:00] while I reboot the master [15:48:18] !log jynus@tin Synchronized wmf-config/db-codfw.php: Failing over x1-master (duration: 00m 41s) [15:48:18] thanks guys, much <3 [15:48:20] you going to attempt graceful one? [15:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [15:48:33] Nikerabbit: in the meantime, did you find why this happened in the first place? [15:48:58] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,cluster=appserver,name=mw2256.* [15:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:05] !log shutting down db2033 (x1-master) [15:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [15:49:39] taking a look at that ^ [15:49:39] it is ongoing [15:49:50] we may get a lot of errors, but not more than we used to have [15:50:15] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:50:49] marostegui, how is the slave taking the master going away? [15:50:51] the slave is having no issues, so that is good [15:51:13] paravoid: my best theory is that the queries are somewhat inefficient by nature, and sudden burst due piling up during the switch just made it explode. But I'm just guessing [15:51:17] I can see logs Server 10.192.48.14 (#1) is not replicating? [15:51:20] mysql process is still up [15:51:22] I guess expected [15:51:26] yep [15:51:27] yeah, i doubt it is going to stop :( [15:51:36] give it a few minutes [15:51:36] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,cluster=elasticsearch,name=elastic2020.* [15:51:38] Nikerabbit: ok, so next steps? [15:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:45] PROBLEM - MariaDB Slave IO: x1 on db1031 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2033.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2033.codfw.wmnet (111 Connection refused) [15:51:46] also All replica DBs lagged. Switch to read-only mode, expected too [15:51:47] dbstores take 20 minutes [15:51:55] Nikerabbit: probably needs a task, and will certainly need an incident report, could you spearhead that? [15:52:07] i have silenced tempdb2001 replication alerts [15:52:48] paravoid: mitigate the issue (your side), I can lead root cause investigation that will produce incident report [15:53:20] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3194141 (10Dzahn) @Papaul just the kernel panic `PANIC: double fault, error_code: 0x0` and this during boot: ``` 5939 Apr 19 14:44:03 mw2256 kernel: [ 28.623132] ACPI... [15:53:22] Nikerabbit: thanks! [15:54:06] shall we leave the https://gerrit.wikimedia.org/r/348951 like that for now Nikerabbit? [15:54:09] mysql stil l up [15:54:25] Nikerabbit: We had, https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation - anything related to that this time? [15:54:29] so I am going to kill it, let it recover [15:54:35] marostegui: I think safest is to leave it disabled for now [15:54:36] yeah, agreed [15:54:43] Nikerabbit: cool, thanks [15:54:56] jynus: +1 to kill it [15:55:09] given TZ and everything, we will see if we can perhaps start enabling it incrementally (or fully if the root cause is figured out and fixed) [15:55:23] !log test [15:55:24] the slave is looking good, so those are good news [15:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:42] I mean, we will see *tomorrow* [15:56:33] deployment server switchover to *naos* (not mira) is happening in 5min [15:57:08] jynus: that was fast :) [15:57:22] mysql going up, I think? [15:57:25] godog: maybe jynus would need to deploy mediawiki-config soon, I would probably hold a bit [15:57:27] it is up [15:57:30] RECOVERY - MariaDB Slave IO: x1 on db2033 is OK: OK slave_io_state Slave_IO_Running: Yes [15:57:40] RECOVERY - MariaDB Slave SQL: x1 on db2033 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:57:40] or get confirmation is ok ;) [15:57:45] RECOVERY - MariaDB Slave IO: x1 on db1031 is OK: OK slave_io_state Slave_IO_Running: Yes [15:57:52] volans: good point, I'll coordinate with him [15:58:00] the question is if it will happen again now [15:58:08] and if it will happen again when enabled [15:58:10] jynus: it shouldn't, as it is disabled, right? [15:58:21] jynus: Nikerabbit said we can leave it off for now [15:58:26] I think we left testwiki and all on now [15:58:46] is heartbeat ok? [15:58:57] slave, etc [15:59:03] yep [15:59:06] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Scap: Remove git_server from scap.cfg [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/347924 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [15:59:11] shall we revert the weight back to the master? [15:59:19] can the slave hold the load? [15:59:23] it was able yes [15:59:26] but with that feature stopped [15:59:27] how many connections running? [15:59:30] around 40 [15:59:37] oh running [15:59:53] connected around 40, running less than 5 [16:00:07] I started a task for addressing the CX outage https://phabricator.wikimedia.org/T163344 [16:00:14] heartbeat seems ok [16:00:23] not sure how to handle load [16:00:37] if it happens again (pileups, it is better if the slave has most of the load) [16:01:12] Let's give the master just a bit of weight and wait for T163344 to be addressed before enabling it back? [16:01:13] T163344: Do a root-cause analysis on CX outage during dc switch and get it back online - https://phabricator.wikimedia.org/T163344 [16:01:17] does that sound reasonable? [16:01:31] can someone check flow and echo working ok, and what is happening to translation? [16:01:45] (03PS1) 10Andrew Bogott: dynamicproxy: When rotating logs, HUP the nginx process. [puppet] - 10https://gerrit.wikimedia.org/r/348954 [16:02:57] elukey: has rdb2005 still issues? [16:03:01] <_joe_> jynus: it works on officewiki [16:03:09] I can still see errors to connect to it [16:03:13] on different ports [16:03:22] <_joe_> volans: where? [16:03:29] logstash [16:03:29] !log disabling deprecation warning logs on elasticsearch codfw - T163345 [16:03:31] _joe_, officewiki is the one that doesn't use x1 :-) [16:03:34] <_joe_> volans: that's normal [16:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:37] with meta, mediawiki and others [16:03:38] T163345: Fix deprecation warning seen in elastic logs - https://phabricator.wikimedia.org/T163345 [16:03:40] <_joe_> jynus: heh, sorry [16:03:40] :-) [16:03:45] volans: it was working the last time that I checked, why? [16:03:50] no, my question was bad [16:03:52] _joe_: only for rdb2005? [16:04:05] (03PS1) 10Chad: Remove lib/ from install, should've been part of prior commit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348955 [16:04:06] <_joe_> volans: it might be more overloaded than others, yes [16:04:30] it would be nice to have echo and flow devels around, to check it was only a glitch for them [16:04:46] jynus or marostegui: the ideal system to move to rebalance power in a1 is db2089 in on yz moving it on xy [16:04:47] then I would not enable translation until the issue can be analyzed [16:04:53] _joe_: ok, for reference: https://logstash.wikimedia.org/goto/0a1984bc17d2765be571c548e9b7d007 [16:05:04] jynus: will you need the deployment server soon? IOW ok to start the deployment server switchover? [16:05:08] its not installed, nm [16:05:10] godog, not for now [16:05:12] ignore my last comment =P [16:05:22] robh: will get back to you later :) [16:05:23] robh, if you are talking new servers, not now [16:05:47] jynus: sorry I had two questions in there with different answers, you won't need it soon? [16:05:54] talking about rebalancing power but it can wait a bit. [16:06:00] jynus: yes, let's leave translation disabled as Nikerabbit said too [16:06:01] godog, deploy [16:06:09] we are on standby now [16:06:16] ack [16:06:29] (03PS3) 10Filippo Giunchedi: Switch deployment CNAMEs to naos.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/348060 [16:06:40] it would be nice to announce the translation tool disabling [16:06:46] that is user-impacting [16:06:50] godog: not discovery hostnames instead? [16:07:24] paravoid, can issues be translated through communication ? [16:07:29] *transmitted [16:07:32] ? [16:07:38] paravoid: not this time but it is a good point, adding to the etherpad [16:07:42] the disabling of translation extension [16:07:49] jynus: I'll handle that [16:07:50] it would be nice to communicate it [16:08:10] I'll sent a post-switchover email soon [16:08:16] "translation extension is temporarily disabled until performance issues are solved" [16:08:20] ^that is the summary [16:08:21] but I was waiting for things to settle down a little bit first [16:08:39] (03CR) 10Filippo Giunchedi: [C: 032] Switch deployment CNAMEs to naos.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/348060 (owner: 10Filippo Giunchedi) [16:08:40] let me go back to enwiki api extra load [16:08:45] So, going back to the API overloaded slaves, they are better, but still not as the load level of the eqiad ones [16:08:54] jynus good timing :p [16:09:04] I still see db2062 at 100% I/O [16:09:15] same for db2069, they have peaks [16:09:17] !log test [16:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:32] That user just spammed -labs [16:09:36] yeah I saw [16:09:58] we put up a notice on talk:cx on mediawiki.org, but yeah some notice that reaches more people makes sense [16:10:03] ok [16:10:30] Nikerabbit, the earlier you can look at those queries and assess how to avoid the issue, the earlier we can reenable [16:10:39] 06Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3194210 (10Dzahn) [16:10:49] Nikerabbit, even if that means some compromise like enablign it only on selected wikis [16:11:01] or saying "it will not happen now" [16:11:15] I will move to other issues [16:11:16] (03PS2) 10Filippo Giunchedi: Switch deployment server to naos.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/348927 [16:11:39] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Dzahn) 05Open>03Resolved closing this again to handle mw2256 in a subtask (please continue on T163346) [16:12:01] marostegui, lots of copying to tmp table [16:12:09] yes [16:12:09] is it a difference in query plan? [16:12:22] i am chceking and in general all the codfw api servers look more loaded than the eqiad ones [16:12:29] yes [16:12:30] hey quick question -- do you have start/stop timestamps for the x1 outage? [16:12:31] only the enwiki ones are 100% disk io [16:12:35] (03PS1) 10Urbanecm: Change the timezone of West Bengal Wikimedians user group wiki to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348956 [16:12:36] (03CR) 10Filippo Giunchedi: [C: 032] Switch deployment server to naos.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/348927 (owner: 10Filippo Giunchedi) [16:12:53] I can try to deduce them from SAL etc. but you probably be more accurate than me [16:12:56] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me, feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/348941 (owner: 10Ayounsi) [16:13:16] paravoid, 36 to 57 [16:13:31] 15:36-15:57 UTC? [16:13:33] including the max_connections [16:13:44] k [16:13:53] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=7&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=x1&var-role=All&from=1492607626923&to=1492618426923 [16:13:57] !log run puppet on naos.codfw.wmnet - new deployment server [16:14:00] thanks [16:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:04] jynus: the query plan is the same on db2062 than db1072 for a slow query I caught earlier [16:14:18] so we may have more queries? [16:14:29] and just they are still warming up? [16:14:33] and/or [16:14:47] I thought about them still warming up, but it is taking quite long if it is the case [16:14:55] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [16:15:04] well, it takes a few hours normally [16:15:33] and I wouldn't be surprised if there is not normal load,but extra load due tyo the temporary interruption [16:15:48] (03PS2) 10Ayounsi: LibreNMS macro for T133852 and T80273 [puppet] - 10https://gerrit.wikimedia.org/r/348941 [16:16:27] the good thing is that it is not creating any outage, so if it is warm up still, it should be gone at some point [16:16:31] hey guys [16:16:38] let's check performance [16:16:39] I just got a report that Flow still doesn't work [16:16:57] (03CR) 10Ayounsi: [C: 032] LibreNMS macro for T133852 and T80273 [puppet] - 10https://gerrit.wikimedia.org/r/348941 (owner: 10Ayounsi) [16:16:57] do they have any error? [16:17:37] 19:15 < Trizek> Got an "Internal Server Error". [16:17:37] 19:15 < paravoid> when? now? [16:17:37] 19:16 < Trizek> I've reloaded the page and get "Service Unavailable" [16:17:42] Trizek is in this channel as well [16:17:42] it is read only [16:17:46] I think it is that [16:17:54] ah, yes, could be [16:17:57] marostegui, I am setting 233 as read write [16:18:00] is that ok? [16:18:11] 233? [16:18:12] db2033? [16:18:19] root@db2033[(none)]> SET GLOBAL read_only=0; [16:18:20] Yeah it looks like x1 is still r/o [16:18:26] heh hi RoanKattouw :) [16:18:29] [Exception DBReadOnlyError] (/srv/mediawiki/php-1.29.0-wmf.20/includes/libs/rdbms/database/Database.php:846) Database is read-only: The database master is running in read-only mode. [16:18:29] just in time :) [16:18:36] whee thanks jynus [16:18:48] marostegui, can you confirm db2033 is the master? [16:19:07] it is [16:19:19] jynus: yeah the plan is that our team tackles this tomorrow morning european time, investigating the cause, fixing what is found, and proceeding with gradual re-enablement if it looks safe [16:19:24] OK it's working now [16:19:25] !log setting db2033 as read write [16:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:44] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194243 (10RobH) [16:19:48] Nikerabbit, be extremely conservative about reenablign it [16:19:53] (03PS2) 10Volans: templates/wmnet: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [16:19:57] please look at tendril to detect long running queries [16:20:02] volans: thanks [16:20:03] marostegui: I've rebased it with the missing ones ^^^ [16:20:09] (03CR) 10jerkins-bot: [V: 04-1] templates/wmnet: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [16:20:21] marostegui, I run puppet before starting mysql [16:20:32] is our puppet code mistaken? [16:20:37] jynus: ack. any additional info, if you have any and when you have time, please link to the task [16:20:46] !log disabling deprecation warning logs on elasticsearch eqiad - T163345 [16:20:49] Nikerabbit, task #? [16:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:54] T163345: Fix deprecation warning seen in elastic logs - https://phabricator.wikimedia.org/T163345 [16:20:58] jynus: https://phabricator.wikimedia.org/T163344 [16:21:04] akosiaris: do you still want to depool puppetmaster1002 ? [16:21:26] paravoid, the time I sent you is for the database to be unresponsive, first for max connections, then for restart [16:21:29] jynus: don't think so [16:21:30] node 'db2033.codfw.wmnet' { [16:21:30] class { '::role::mariadb::core': [16:21:30] shard => 'x1', [16:21:30] master => true, [16:21:44] then why it started in ro mode? [16:21:48] marostegui, jynus: x1-slave should point to tempdb2001, the eqiad slave or the codfw master? [16:22:06] good question :) [16:22:06] volans, let's not send that patch yet [16:22:17] I'm not planning to merge it now jynus [16:22:19] 09:13:32 <@paravoid> 15:36-15:57 UTC? [16:22:35] then if you have some spare time, help us with load assesment :-) [16:22:50] sure, I've asked before too if I could be of any help [16:22:54] tell me [16:22:55] But it sounds like actual outage of x1-dependent services was 15:36-18:18 because of the read-only issue? [16:23:06] Sorry, 15:36-16:18 [16:23:10] I think so, yes [16:23:16] OK :/ [16:23:26] So that means we collected no notifications at all for like 45 minutes [16:24:13] it depends, who sends the notifications? [16:24:23] if it is the queue, they will be retried [16:24:27] I mean, MediaWiki notifications [16:24:38] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194278 (10Papaul) ps1-a3-codfw moving mw2215 ps1-c6-codfw moving db2083 ps1-d6-codfw moving db2063 [16:24:45] "Someone edited your talk page", "someone mentioned you", etc [16:24:55] what else was affected? CX, Flow, Echo, ...? [16:25:08] Just those three AFAIK [16:25:11] (03PS3) 10Volans: templates/wmnet: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [16:25:19] CX and Flow being down is annoying but at least it's obvious [16:25:22] no plan to merge it, just for later and need x1 review [16:25:32] RoanKattouw, you understand that not rebooting would have cause worse issues? [16:25:44] Echo being down is bad because editors will ping others and not know that their pings will never arrive [16:25:51] because CX has taken over all connections [16:26:09] jynus: Yeah I understand we had a broken server and had to deal with things [16:26:16] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194283 (10RobH) [16:26:22] I also asked for someone to check those other services [16:26:23] mutante: what do you mean "still" ? [16:26:27] I only just walked in here ,so obv I'm missing a bunch of context [16:26:35] we should depool it [16:26:43] and aside from joe, no one did [16:26:46] akosiaris: i read email about row D and "puppetmaster1002 should be depooled. Really really really easy to do, will do so after the switchover." [16:27:02] hey easy now [16:27:08] no, not today [16:27:15] but maybe tomorrow [16:27:21] got it [16:27:24] :-) [16:27:41] But when I arrived here I got the impression (which might have been wrong, please correct) that people had just forgotten to take x1 out of r/o mode for like 20 minutes, which as the person responsible for most x1-dependent things doesn't exactly make me happy [16:27:42] RoanKattouw: is there any kind of follow-up we can do about this? [16:28:04] paravoid: I would like x1 being RO to be treated more seriously in general [16:28:09] RoanKattouw: We are dealing we a few things at the same time [16:28:19] that's not what happened and it was absolutely was [16:28:29] OK, good to know [16:28:30] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194300 (10RobH) >>! In T163339#3194278, @Papaul wrote: > ps1-a3-codfw > moving mw2215 This is on bank XZ and move it to bank XY? > > ps1-c6-codfw > moving db2083 This is o... [16:28:36] in fact dealing with x1 took over priority of dealing with overloaded API slaves [16:29:16] aiui, the original master died because of CX and a new one had to be promoted [16:29:34] or was it that it force-rebooted and it came up as slave? [16:29:46] anyway, now is not the time to do the postmortem as there are other ongoing issues [16:29:47] we failed over the reads to the slave [16:29:52] but we'll definitely do that [16:29:53] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194320 (10Papaul) mw2215 is on zx db2083 is on zx db2063 is on xy [16:29:55] paravoid: it was forced-rebooted and it came back read only [16:30:08] Sounds good [16:30:19] Things seem up now, so no need to do anything more about x1 right now from where I stand [16:30:42] we probably need to add an alert or something that would warn us when r/w servers are r/o [16:30:52] Yes, that would be good in general [16:30:55] but again, postmortem material :) [16:31:04] I should also write some docs explaining what things rely on x1 [16:31:12] (and what bad things happen when x1 is r/o or down) [16:31:23] yeah, I think this was some insitutional knowledge that was lost/forgotten a little bit [16:31:28] I certainly knew at some point but forgot [16:31:39] RoanKattouw: But as you said as the person responsible for most x1-dependent things, it will be good to have you for the switch over back to eqiad in a couple of weeks online, just in case we need help from x1 stuff again (hopefully not) [16:31:53] indeed, that's a good point :) [16:31:55] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [16:32:00] I can do that, when is it scheduled for? [16:32:00] (03PS2) 10Nemo bis: Change the timezone of West Bengal Wikimedians user group wiki to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348956 (https://phabricator.wikimedia.org/T163322) (owner: 10Urbanecm) [16:32:05] (03PS1) 10Chad: Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 [16:32:05] May 3rd [16:32:07] 14:00 UTC [16:32:08] (03CR) 10Nemo bis: [C: 031] Change the timezone of West Bengal Wikimedians user group wiki to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348956 (https://phabricator.wikimedia.org/T163322) (owner: 10Urbanecm) [16:32:23] it's on the deployment calendar, email to engineering@, wikitech-l@, wmfall@ etc. [16:32:25] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3194345 (10Gehel) looking at `/var/log/kern.log` and `/var/log/syslog` nothing is logged at the time of the crash. [16:32:36] Right, sorry, I could have looked that up myself easily, sorry for my lazines [16:32:40] I'll put it in my calendar [16:33:03] no that's ok, it's just that it may change for whatever reason [16:33:06] it's unlikely but it may happen [16:33:16] so I'll keep those mediums up-to-date [16:33:23] !log deploy.fixurl on G@deployment_target:* after deployment server switchover [16:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:40] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3194348 (10Cmjohnson) The ripe atlas plug on the backside of the server fell out, Even when I plug it back it the connection seems very loose. Not sure if all the vibration from the rack and/or heat has caused the... [16:35:58] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194354 (10RobH) >>! In T163339#3194320, @Papaul wrote: > mw2215 is on zx This one should work. Don't move until we confirm with opsen for that service group. > db2083 is o... [16:36:00] RoanKattouw: it's unfortunate that CX caused outages affects Echo and Flow. We clearly need to work more to avoid that happening again. [16:36:34] Nikerabbit: can you start the incident page so that RoanKattouw/jynus/marostegui can start documenting some of the findings/actionables? [16:36:35] Nikerabbit: Would you like me to help investigate the CX issue? I'm suspicious of the auto-retry feature in particulra [16:37:32] RoanKattouw: help is welcome [16:37:38] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3194357 (10faidon) It's still down as of now, so are you sure it's plugged now? (and yes, please work with @RobH to buy the cable :) [16:37:44] paravoid: sure, looking how to do it [16:38:00] Nikerabbit: there's a form @ https://wikitech.wikimedia.org/wiki/Incident_documentation [16:38:29] yep [16:39:53] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3194365 (10RobH) We should have a lot of spare power cables for these, as they are simple c13/c14 power cables, correct? If the cable won't seat firmly, and different cables have the same issue, I've solved it at p... [16:40:44] BTW it sounds like the cause of T163337 has been tracked down [16:40:45] T163337: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337 [16:41:38] Uniqueness constraints that were present in eqiad were not present in codfw [16:41:58] it doesn't sound like they were present in eqiad either? [16:42:11] okay, very empty incident page started: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-ContentTranslation [16:42:15] Can I run maintenance script in mira? [16:42:20] or is there another node? [16:42:32] Amir1: naos [16:42:53] Amir1: also ask godog [16:42:57] RoanKattouw: yes, I think both the re-try and the queries should be checked. This is quite similar to the previous incident report. [16:42:58] was completing the migration [16:43:05] Amir1: mira is dead from a hardware failure (purely coincidentally with the codfw switch because things have to be made difficult :P) [16:43:19] RoanKattouw: It's not just that. I'm checking database entries [16:43:24] OK [16:43:33] I spoke too soon then [16:43:33] Amir1: yep it should work already for deployments, modulo irc notifications which I'm fixing [16:44:09] Thanks [16:44:45] jynus, marostegui: should I file a task for the API slave overload or is this redundant because it will be fixed by the new batch of servers anyway? [16:45:06] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194384 (10Papaul) mw2215 is on zx (a3) db2043 is on xy (c6) db2061 is on xz (d6) [16:45:19] (03PS1) 10Filippo Giunchedi: tcpircbot: allow naos [puppet] - 10https://gerrit.wikimedia.org/r/348964 [16:45:26] paravoid: you can file it there if it is easier for tracking what we see or do during the investigation [16:45:42] ok :) [16:45:49] thanks :) [16:46:51] 06Operations, 10DBA, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194385 (10faidon) [16:47:17] (03CR) 10Filippo Giunchedi: [C: 032] tcpircbot: allow naos [puppet] - 10https://gerrit.wikimedia.org/r/348964 (owner: 10Filippo Giunchedi) [16:49:07] !log ladsgroup@naos:~$ mwscript extensions/ORES/maintenance/CleanDuplicateScores.php --wiki=enwiki (T163337) [16:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:16] T163337: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337 [16:52:54] 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194421 (10RobH) >>! In T163339#3194384, @Papaul wrote: > mw2215 is on zx (a3) I've just checked with @joe, you can move the power plugs for mw2215 now. Just move them slowl... [16:53:55] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 0 [16:55:06] RoanKattouw: mw.Uri not defined errors in prod on enwiki, seeing it almost every page view. I'm just getting online now and wil catch up with events but if all is good, I'll request a deploy window ASAP for that. [16:56:13] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194442 (10RobH) >>! In T163339#3194384, @Papaul wrote: > db2043 is on xy (c6) > db2061 is on xz (d6) Both of those db hosts are slaves in the s3 and s7 shards, not... [16:56:14] Krinkle: It's my fault, see my patch in WikimediaEvents yesterday [16:56:19] thcipriani: I'm testing scap3 deployments on naos, LGTM with prometheus/jmx_exporter but please take a look too cc RainbowSprinkles twentyafterfour [16:56:31] RoanKattouw: Yeah, I know. James cherry-picked it to wmf.20 which I'll push out today if possible. [16:56:34] modulo irc notifications via tcpircbot, still looking [16:57:23] godog: I'll look at the prometheus/jmx_exporter logs on naos right quick [16:58:02] !log ladsgroup@naos:~$ mwscript extensions/ORES/maintenance/CleanDuplicateScores.php --wiki=enwiki froze [16:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:14] !log power balancing on mw2215 [16:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:17] godog: looks like aqs1008 is looking at and fetched from the right deployment server, has the latest tag in its deployment cache: looks like a success to me. [17:02:26] <_joe_> !log running manally enwiki refreshLinks jobs to catch up a bit [17:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:51] !log bounce tcpircbot on einsteinium to pick up changes [17:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:02] !log filippo@naos Started deploy [prometheus/jmx_exporter@7327459]: test deploy from naos [17:03:05] !log filippo@naos Finished deploy [prometheus/jmx_exporter@7327459]: test deploy from naos (duration: 00m 03s) [17:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:23] thcipriani: success indeed! [17:08:21] !log thcipriani@naos.codfw.wmnet test [17:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:37] thcipriani: I'd like to validate mw deploys are also working, the easiest is probably a dummy commit on e.g. wmf-config ? [17:08:53] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194453 (10RobH) Ok, mw2215 has been moved, but a3 is still unhappy: X 14.5 Y 8.3, Z 9.6 So now X is quite high, while Z is back to a more normal rate. [17:09:18] godog: yeah, making a change to the README in wmf-config ought to be sufficient [17:09:32] then: scap sync-file README 'test mw deploys' [17:09:44] (03PS1) 10Dzahn: Icinga: add simple plugin to check CPU frequency [puppet] - 10https://gerrit.wikimedia.org/r/348966 (https://phabricator.wikimedia.org/T163220) [17:09:59] (03CR) 10Muehlenhoff: [C: 032] Remove lib/ from install, should've been part of prior commit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348955 (owner: 10Chad) [17:11:05] thcipriani: kk, I'll fix a trailing whitespace :P [17:11:52] godog: cool, I can deploy it and make sure all is well if you'd like. [17:12:36] (03PS1) 10Filippo Giunchedi: docroot/noc/index.html: trailing whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348967 [17:12:49] thcipriani: sure, ^ [17:13:25] (03CR) 10Thcipriani: [C: 032] docroot/noc/index.html: trailing whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348967 (owner: 10Filippo Giunchedi) [17:14:40] (03Merged) 10jenkins-bot: docroot/noc/index.html: trailing whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348967 (owner: 10Filippo Giunchedi) [17:15:07] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194475 (10RobH) >>! In T163339#3194453, @RobH wrote: > Ok, mw2215 has been moved, but a3 is still unhappy: > > X 14.5 Y 8.3, Z 9.6 > > So now X is quite high, whil... [17:15:13] (03PS3) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 [17:15:45] (03CR) 10jenkins-bot: docroot/noc/index.html: trailing whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348967 (owner: 10Filippo Giunchedi) [17:17:16] marostegui: so the MySQL grants, how does it work? you need to still run that file after a merge, right? I amended to https://gerrit.wikimedia.org/r/#/c/348565/3/modules/role/templates/mariadb/grants/production-m3.sql.erb [17:17:50] (to add to dbproxy1003/dbproxy1008) [17:17:55] godog: pulling over to mwdebug to ensure sanity :) [17:17:58] mutante: busy with stuff from the dc switch, will take a look later [17:18:07] mutante: but yes, basically you need to add those manually on the mysql prompt [17:18:33] marostegui: alright, thanks. it has lots of time [17:19:13] mutante: thanks :) [17:19:13] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194503 (10Papaul) I have no opening on yz [17:22:59] 06Operations, 05codfw-rollout: Investigate issues with uploading/deleting files during 2017-04-19 switch - https://phabricator.wikimedia.org/T163354#3194516 (10faidon) [17:23:04] godog: ok, scap pull to a debug host looks good, deploying everywhere [17:23:28] thcipriani: neat, thanks! [17:23:32] blergh, that's misnamed [17:24:18] 06Operations, 05codfw-rollout: Find a way to verify mediawiki-config IPs ahead of datacenter switchovers - https://phabricator.wikimedia.org/T163354#3194533 (10faidon) [17:24:19] paravoid: Re the ORES/watchlist thing, it looks like ORES jobs from the week before the switchover might have gotten run again, causing duplicate DB rows to be inserted. Along those lines, I note an earlier comment in this channel: "[14:54:04] <_joe_> the jobqueue has just re-enqueued 5 million jobs" [17:25:04] Yes, so it won't continue to happen for ORES but we need to clean the duplicate ones [17:25:16] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194534 (10Papaul) moving msw-a3-codfw from yz to xy [17:25:18] I take care of that [17:25:19] !log mobrovac@naos Started restart [restbase/deploy@1bfada4]: Restart to stop trying to connect to dead restbase1018 Cassandra instances - T163292 [17:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:28] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [17:25:55] mobrovac: LMK if you see any issues with naos [17:26:13] kk godog, will do, just doing a RB restart [17:26:15] Amir1: "We" (whoever "we" is, probably not me) also need to figure out why the jobs were re-run. [17:26:18] 06Operations, 10DBA, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194538 (10Marostegui) Update we have found that the query plan isn't the same for all the queries: https://phabricator.wikimedia.org/P5293 We believe this is cause... [17:26:46] !log thcipriani@naos Synchronized docroot/noc/index.html: test scap on naos.codfw.wmnet[[gerrit:348967|docroot/noc/index.html: trailing whitespace]] (duration: 02m 02s) [17:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:02] !log mwscript extensions/ORES/maintenance/CleanDuplicateScores.php on all wikis with ORES review tool enabled (T163337) [17:27:03] <_joe_> Amir1: see my comment on the task [17:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:08] T163337: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337 [17:27:33] ^ godog sync'd, seemed to go fine, a little slow to sync one of the masters, but I imagine it gets faster after this [17:27:35] <_joe_> or is it just tasks inserted before the switchover? [17:28:25] RECOVERY - MegaRAID on ms-be1002 is OK: OK: optimal, 13 logical, 13 physical [17:28:34] would make sense if job-queue is getting duped for old events [17:29:14] thcipriani: ack, thanks for your help! [17:29:15] (03CR) 10ArielGlenn: "Tested a standalone python script with the needed functions in it to verify both failure and success. See comment however." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) (owner: 10Hashar) [17:29:17] _joe_: I don't see duplicate DB rows for edits from after the switchover, but I don't know how to check if there are still duplicate jobs queued but not yet run. [17:29:49] <_joe_> anomie: ok thanks [17:29:51] godog: no problem at all, thanks for testing deploy stuff :) [17:30:19] <_joe_> that coincides with my feeling that jobs that wait more than 15 minutes get ran twice [17:30:35] <_joe_> I'll look into it better tomorrow [17:30:49] _joe_ anomie: In enwiki edits after the switchover, there is no duplication [17:31:00] an example: 776215769 [17:31:01] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192581 (10BioPseudo) a:05Eevans>03DavidGreens [17:31:16] <_joe_> Amir1: ok [17:31:35] <_joe_> I think next time we'll be more aggressive in going RO [17:33:24] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3194617 (10mobrovac) a:05DavidGreens>03Eevans [17:34:45] response time has gone back to normal [17:35:06] https://grafana.wikimedia.org/dashboard/db/performance-metrics?refresh=5m&orgId=1&from=now-6h&to=now [17:35:39] <_joe_> Amir1: how many duplicates are we talking about? [17:36:01] I cleaned all databases except enwiki [17:37:47] All were zero except big wikis [17:37:57] _joe_: On enwiki currently, I see 347112 sets of duplicate rows. [17:38:18] wikidata = 8K, nl = 486, pl = 150 [17:38:29] <_joe_> ok [17:38:49] _joe_: there is maintenance script that cleans them. I'm running it [17:38:51] <_joe_> so definitely NOT something that could happen because of unacknowledged jobs [17:39:08] <_joe_> Amir1: yeah I'm trying to get a hint to the root cause [17:39:18] (03PS1) 10Marostegui: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348970 (https://phabricator.wikimedia.org/T163351) [17:39:56] (03PS1) 10BryanDavis: Remove bd808 from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/348971 [17:41:45] _joe_: It freezes on enwiki (surprise) I can either wait or try to delete them using eval.php [17:41:47] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194681 (10RobH) Ok, things shifted drastically X/Y/Z are at 9/9/14. I've suggested to @papaul we pick one mw system off xz, and one off yz, and move them both onto... [17:43:25] jynus: https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1 [17:43:29] (03PS2) 10Dzahn: Icinga: add simple plugin to check CPU frequency [puppet] - 10https://gerrit.wikimedia.org/r/348966 (https://phabricator.wikimedia.org/T163220) [17:43:38] backend save time is more accurate [17:43:51] https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=now-6h&to=now [17:44:08] and has not recovered at p75 [17:44:21] still 2x [17:44:42] That 347112 has not changed in the 7 minutes since I last checked, which is a good sign that it's not still duplicating. [17:44:51] (03CR) 10Dzahn: [C: 032] Icinga: add simple plugin to check CPU frequency [puppet] - 10https://gerrit.wikimedia.org/r/348966 (https://phabricator.wikimedia.org/T163220) (owner: 10Dzahn) [17:44:55] Krinkle, do you know why? [17:45:23] lowest: 77ms > 85ms, p50: 200 > 300ms, p75: 400 > 700ms, p95: 1.6s > 2.2s [17:46:19] I'm not sure why. I'll look into it in an hour, brb [17:47:22] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194699 (10Marostegui) The first hack hasn't worked as expected. We are thinking about just depooling the slave and run the normal analyze tab... [17:48:43] Krinkle, https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-7d&to=now [17:52:45] Urbanecm: do you really need access to all projects for your OAuth consumer? [17:53:20] tgr: No, I've forgotten to add the project to the consumer. Is it possible to do it now? [17:56:05] tgr: It is for Commons only. [17:56:22] (03CR) 10Muehlenhoff: [C: 031] Remove bd808 from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/348971 (owner: 10BryanDavis) [17:56:28] (03PS2) 10Muehlenhoff: Remove bd808 from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/348971 (owner: 10BryanDavis) [17:58:10] (03CR) 10Muehlenhoff: [C: 032] Remove bd808 from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/348971 (owner: 10BryanDavis) [17:59:06] Urbanecm: not without creating a new consumer, I'm afraid [17:59:24] tgr: Okay. Can you decline it? I will create new one in a moment. [17:59:55] sure [18:02:16] tgr: I've created new request. [18:06:24] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1002 - https://phabricator.wikimedia.org/T163209#3194766 (10Cmjohnson) I replaced the disk and cleared the cache but it's not coming back @fgiunchedi please take a look. [18:07:45] (03CR) 10ArielGlenn: [C: 032] fix bug that produced badly named page range files [dumps] - 10https://gerrit.wikimedia.org/r/347182 (owner: 10ArielGlenn) [18:09:11] (03CR) 10ArielGlenn: [C: 032] extra verbosity for page ranges we will probably toss later [dumps] - 10https://gerrit.wikimedia.org/r/348268 (owner: 10ArielGlenn) [18:09:16] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:09:52] 06Operations, 10ops-eqiad: eqiad: Rack and set (1) fundraising syslog system (replacing: indium) - https://phabricator.wikimedia.org/T163361#3194772 (10Cmjohnson) [18:10:13] (03CR) 10ArielGlenn: [C: 032] last page range for page content job would sometimes have too many revs [dumps] - 10https://gerrit.wikimedia.org/r/347627 (owner: 10ArielGlenn) [18:10:29] 06Operations, 10ops-codfw: audit all codfw pdu tower draws - https://phabricator.wikimedia.org/T163362#3194789 (10RobH) [18:13:53] (03CR) 10ArielGlenn: [C: 032] scripts to generate a series of checkpoint files for a dump run manually [dumps] - 10https://gerrit.wikimedia.org/r/342846 (https://phabricator.wikimedia.org/T160507) (owner: 10ArielGlenn) [18:14:42] (03CR) 10ArielGlenn: [C: 032] permit the page range job shell script to run without locks if desired [dumps] - 10https://gerrit.wikimedia.org/r/348302 (owner: 10ArielGlenn) [18:14:47] 06Operations, 10ops-eqiad: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3194817 (10Jgreen) [18:14:49] 06Operations, 10ops-eqiad: eqiad: Rack and set (1) fundraising syslog system (replacing: indium) - https://phabricator.wikimedia.org/T163361#3194822 (10Jgreen) [18:14:55] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2617.08 ms [18:15:23] what is that? [18:15:27] fundraising [18:15:29] ffrack [18:15:30] It's really complex to delete those rows manually but all recent ones should be cleaned by now [18:15:33] yeah [18:15:38] it was missing the party :D [18:15:45] RECOVERY - Host alnilam is UP: PING WARNING - Packet loss = 0%, RTA = 1999.61 ms [18:15:47] Jeff_Green: we probably should revisit the paging stragety for frack servers :) [18:15:54] ah that's the page then [18:16:14] Jeff_Green: pinging all of ops for each individual frack server is probably not very useful nowadays, most of ops don't even have access [18:16:19] true [18:16:36] (is this an actual issue btw? if so, I'll shut up and follow-up elsewhere) [18:16:38] we don't have a very granular paging setup [18:16:59] is alnilam down because it is replaced by frlog1001? [18:17:05] i'm not sure why it paged yet [18:17:20] i do have an rsync going at that DC, but it's throttled and it's 30min in [18:17:33] so I'm guessing it's more pfw being lame, not sure yet [18:17:57] RTA = 2617.08 ms [18:18:11] huh. [18:18:11] that's probably it [18:18:15] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3087.96 ms [18:18:18] !log ariel@naos Started deploy [dumps/dumps@101f8a4]: page range fixes and standalone scripts [18:18:21] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3099.45 ms [18:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:26] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3088.01 ms [18:18:37] !log ariel@naos Finished deploy [dumps/dumps@101f8a4]: page range fixes and standalone scripts (duration: 00m 18s) [18:18:38] multiple hosts ? hmmm [18:18:40] sigh [18:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:45] ok i stopped the rsync [18:18:45] switch? [18:18:46] jynus: I'm going to shrink ores_classification table really big (probably to its one 15th) Can you shrink it once I'm done? [18:19:24] yes, but not this week [18:19:33] !log restbase stopping RB and disabling puppet on restbase1018 due to T163292 [18:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:41] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [18:20:21] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms [18:20:24] the curious thing to me is...why is it never hosts involved in the rsync that page? [18:20:26] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.44 ms [18:20:29] it's always peripheral hosts [18:20:31] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [18:20:32] I wanted to get rid of the old duplications at the same time (two birds, one stone) [18:20:47] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8742204950: Failed to establish a new connection: [Errno 111] Connection refused,)) [18:20:56] yeas, I mean that you can run it [18:20:57] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [18:21:08] aaaaannd there goes RB again? [18:21:11] but I may not be able to shrink it after a few weeks [18:21:31] are ops pagers actually going off for the frack stuff, or is it just that it's spammy in this channel [18:21:31] *but [18:21:43] Jeff_Green: ops pagers actually going off [18:21:44] Jeff_Green: paged [18:21:44] pagers [18:21:47] ok [18:22:31] (03PS2) 10ArielGlenn: Update instructions for fetching mwbzutils source [dumps] - 10https://gerrit.wikimedia.org/r/347907 (owner: 10Awight) [18:22:48] * bblack insert joke about RB code rounding off integers to 53-bit doubles somehow causing all the issues today, but my brain is too drained to be creative enough [18:23:05] :D [18:23:06] I'm not sure what makes sense as a strategy...having two people, both in the US, for coverage isn't stellar [18:23:26] (03CR) 10ArielGlenn: [C: 032] Update instructions for fetching mwbzutils source [dumps] - 10https://gerrit.wikimedia.org/r/347907 (owner: 10Awight) [18:23:51] yeah [18:24:04] maybe an escalation sort of thing would be ideal, where it escalates to ops if the alert isn't acked in X minutes [18:24:13] or something along those lines [18:24:23] (03PS2) 10ArielGlenn: Document quote gotcha; include new binary path [dumps] - 10https://gerrit.wikimedia.org/r/347908 (owner: 10Awight) [18:24:36] hmm. i wonder what icinga is capable of in terms of escalation [18:24:48] it's probably more useful to ping fr-tech software engineers than random opsens at this point [18:24:56] (03PS1) 10Dzahn: base: add icinga check for CPU frequency on Dell R320 [puppet] - 10https://gerrit.wikimedia.org/r/348976 (https://phabricator.wikimedia.org/T163220) [18:25:09] pages are actually going off, Jeff_Green [18:25:20] I'm sitting in here anyways so I don't care much but other folsk might [18:25:20] ok [18:25:20] Krinkle, I think the saving timing metrics are wrong [18:25:35] (03CR) 10ArielGlenn: [C: 032] Document quote gotcha; include new binary path [dumps] - 10https://gerrit.wikimedia.org/r/347908 (owner: 10Awight) [18:25:39] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3194902 (10RobH) row a done [18:25:40] paging people that don't have access makes no sense to me, we can just ping/page other people with access :D [18:25:42] See: https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=saveTiming&refresh=5m&orgId=1 [18:25:57] icinga has "serviceescalation" defines, but we don't have abstractions in puppet yet [18:25:58] they are high since yesterday- traffic failover [18:26:07] https://docs.icinga.com/latest/en/escalations.html#notificationsescalated [18:26:17] which means it could be an issue with geographical measuring [18:26:55] volans: agreed, it's just that it's not that simple, since there is a small amount of overlap with Ops re. access and administration [18:26:59] !log ariel@naos Started deploy [dumps/dumps@ad621e6]: doc fixes thanks to awight [18:27:04] !log ariel@naos Finished deploy [dumps/dumps@ad621e6]: doc fixes thanks to awight (duration: 00m 04s) [18:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:36] Jeff_Green: ok [18:29:11] (03PS2) 10Dzahn: base: add icinga check for CPU frequency on Dell R320 [puppet] - 10https://gerrit.wikimedia.org/r/348976 (https://phabricator.wikimedia.org/T163220) [18:29:47] so far we do have "email notifications for non-ops contact groups" but not "sms for non-ops contact groups" [18:30:46] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3194938 (10faidon) [18:31:12] i guess, similarly, it doesn't do anyone any good to include me in Tech Ops alerts [18:31:38] most of the pages i get at odd hours are about core services, not fundraising related [18:31:48] jynus: Navigation Timing saveTiming = Frontend save timing (pressing "Save change" until time to to first byte from Post-Redirect-Get response) [18:31:57] jynus: Also shown further down the save timing dashboard [18:32:25] jynus: Backend save timing is within mediawiki PHP [18:32:39] collected and measured within PHP [18:33:07] we can change notification options (email, SMS or both) for each individual contact [18:33:33] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194977 (10Marostegui) >>! In T163339#3194442, @RobH wrote: >>>! In T163339#3194384, @Papaul wrote: >> db2043 is on xy (c6) >> db2061 is on xz (d6) > > Both of those... [18:33:38] you could have 2 contacts, one that just sends mail and is in ops [18:33:49] and one that creates SMS and is just used with FR services [18:34:12] and they can also have separate timezone settings for when they notify [18:35:30] the issue they ahve is there are only two in fr-tech-ops who can respond to actual power/network issues [18:35:37] so they dont have full coverage no matter what [18:36:09] heh, there is a task for discussion =] [18:36:30] (03PS1) 10ArielGlenn: remove old dead dumps code from ariel branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348981 [18:36:32] ok, that's another issue then. just pointing out it's also not solved technically yet [18:36:53] so far there is just "critical then ops" logic [18:37:15] ok, task it will be [18:37:58] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 646529 [18:38:41] (03CR) 10jerkins-bot: [V: 04-1] remove old dead dumps code from ariel branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348981 (owner: 10ArielGlenn) [18:43:03] (03PS2) 10ArielGlenn: remove old dead dumps code from ariel branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348981 [18:43:27] one thing we could think about would be to pick different indicators for the health of the things the respective parties can address, for example there's no sense alerting netops about ping times for individual machines when the thing they would likely address is "did the router fall over" [18:45:25] when I see an fr page, if no one is around I look for you, Jeff_Green (especially since I have no access)... so I'm not sure how much help that is :-D but granularity would be good [18:45:48] happy to consider anything that gets the right people looking at it sooner [18:46:19] sorry for the latency, I'm trying to get some cleanup done before brain checks out for good tonight [18:46:40] (03CR) 10ArielGlenn: [C: 032] remove old dead dumps code from ariel branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348981 (owner: 10ArielGlenn) [18:47:47] 06Operations, 10hardware-requests: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#3195100 (10faidon) [18:49:35] (03PS2) 10ArielGlenn: updated for support up through MW 1.29 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347625 [18:50:08] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:40] apergos: yup, that's the right thing to do IMO, and you can also look for Casey now [18:56:12] +1 [18:56:47] I'm puzzled where the user and contact_group config is done for icinga [19:00:25] some of that is in the private repo, I'd have to look [19:00:55] yeah I'm there, but I don't see where 'admins' or 'sms' translate to individual users [19:01:37] so that all our contact info isn't splashed all over teh internets [19:02:47] the contactgroups are in the public repo [19:02:49] ah found it, it's in the main puppet repo yeah [19:03:51] contacts.cfg [19:03:57] ah that's what you wanted [19:04:02] yeah those two in combo get it done [19:04:03] it's not gonna be a very trivial change to just add people to a group [19:05:07] we'll need a bit more. the current logic is that if a service has "critical => true" then it will add the "sms" contactgroup to existing groups [19:05:10] nope, but we're sure going to want it [19:05:19] * apergos looks around for mor itz :-P [19:05:22] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3195165 (10jcrespo) [19:06:12] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194385 (10jcrespo) This was and old friend^ we should either index hint, send it to RC or something soon- it is now failing too often. See su... [19:07:08] so if I remove 'sms' from contact_groups for frack hosts, will that have the desired impact? or will 'critical' alerts still get 'sms' auto-added? [19:07:57] (03PS3) 10Chad: Move contribution tracking config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) [19:08:37] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 2411 [19:09:50] Jeff_Green: if a service is critical => true and there is no "do_paging: false" in Hiera for it, it will add the sms group [19:10:26] ah, then I guess we need to fix that for starters [19:10:30] removing it from the frack_hosts contact_group will not change that [19:10:38] that's the part i meant above, yea [19:10:53] yeah it took me to now to wrap my head around it :-P [19:11:38] let's first describe the desired end result and then we look at how to change icinga for it [19:12:19] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3195178 (10faidon) The latest from Juniper: ``` Faidon, I just got more information on this Case. The current PR tracking this issue is 1238906. Which is in... [19:13:20] mutante: ok. i don't really see a good option other than removing Ops people from frack alerts [19:14:01] eventually I might be able to identify specific services to monitor that are worth paging Ops about, but right now I don't see it [19:14:32] Jeff_Green: but you'd still want _somebody_ to get pages about them too, i assume [19:14:34] I think I'll also take myself out of core services monitoring and extend my hours [19:15:12] Jeff_Green: you can switch your notification method to email-only, that's easy. [19:15:42] Jeff_Green: but not so easy to remove yourself from core services while keeping SMS for other services [19:15:54] that needs fixing [19:16:00] unless you do the "2 contacts" thing [19:16:05] we're going to need multiple SMS capable groups to make this sane [19:16:08] jgreen and jgreen-fr or something [19:16:24] maybe jgreen-email vs jgreen-sms I guess, that's ok with me [19:16:45] well no, that's not enough actually [19:17:19] then you still need changes to tell icinga which hosts and services are FR [19:17:26] and use the new contactgrops [19:17:28] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3195205 (10akosiaris) >>! In T133387#3195178, @faidon wrote: > The latest from Juniper: > ``` > Faidon, > > I just got more information on this Case. > > The... [19:17:29] but we can do that [19:17:40] just saying this is a bit bigger than it might first seem [19:17:47] yeah [19:17:58] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 9365 [19:17:59] what is the new contactgroups? [19:18:46] one that is associated with all services and hosts that are Fundraising, has a set of members we still have to determine, and actually sends out SMS [19:18:56] while it is unrelated to other services/hosts [19:19:05] ok [19:19:21] what is the status of restbase1018? [19:19:26] still alarming on icinga [19:19:38] since 1h [19:19:57] https://phabricator.wikimedia.org/T163292 [19:20:10] ah right is the failed one, thanks mutante [19:20:14] eevand mobrovac working on it [19:20:24] !log krinkle@naos Synchronized php-1.29.0-wmf.20/extensions/WikimediaEvents/extension.json: T162604 (duration: 01m 20s) [19:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:32] T162604: Exception in module-execute in module ext.wikimediaEvents.loggedin: mw.Uri is not a constructor - https://phabricator.wikimedia.org/T162604 [19:20:45] i have ACKed alerts before but it keeps flapping, which removes the ACKs [19:20:56] then i put it into 2 days downtime afair [19:21:03] probably a downtime is better :D [19:21:06] in this case [19:21:07] i think i did [19:21:15] ah already [19:21:15] great [19:21:16] i clicked on the disable checks in icinga for it earlier [19:21:16] :D [19:21:16] and that's why you see on web ui but not here [19:21:24] apparently that did nothing [19:21:52] fyi, puppet is disabled there as well since we don't want RB to be up on that node [19:22:00] it did, mobrovac [19:22:02] !log krinkle@naos Synchronized php-1.29.0-wmf.20/resources/src/mediawiki/mediawiki.js: Ie50bdda229e48b (duration: 00m 58s) [19:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:09] it doesnt check actively anymore now [19:22:31] but not checking also means the status cant change [19:24:22] "disable notifications" would also make it silent but let it keep checking. downtime has the advantage that it expires at some point while disabling things means it's easy to forget turning it back on [19:33:23] mutante: can you explain/show me where the sms contact group gets added when there's a critical alert? [19:34:32] mutante: IMO we should have two new contact groups (fr-tech-ops-sms, fr-tech-sms) which behave like sms does now, for fundraising hosts, but I don't know how we'd do that with icinga [19:35:48] Jeff_Green: modules/monitoring/manifests/service.pp:44 and following [19:36:03] thanks, looking [19:36:57] huh, does this even come into play for nsca-only hosts? [19:38:52] if you use monitoring::service yes :D [19:40:14] akosiaris: quick question. Is ores switched to codfw? [19:40:24] Amir1: of course [19:40:48] It wasn't explicitly in https://wikitech.wikimedia.org/wiki/Switch_Datacenter I was thinking to ask [19:40:50] thanks [19:41:50] Amir1: it is. under https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Traffic_-_Services [19:41:59] along with the others in cache_misc [19:42:20] marked as active/active [19:42:31] oh, it's active/active [19:42:37] nice [19:42:44] can we stay like that? [19:42:50] * Amir1 loves twice capacity [19:43:13] heh, I detect a different definition of active/active [19:43:17] Amir1: that's not twice capacity, but same capacity in high availability :D [19:43:35] volans: ok I guess that would be no. so now I just need to figure out how it is that we send only critical alerts to sms [19:44:17] Amir1: for starters, for the duration of the switchover we will not be active/active. That's the point. Take EQIAD off. then after the fallback we can discuss if we want ORES to receive user requests in codfw as well [19:44:18] volans: Is there a place that I can read more? [19:44:34] it should be fine AFAIK [19:45:08] active/active for this means that requests can flow to both DCs without causing any kind of weird issues. As in both DCs can serve requests [19:46:18] Amir1, akosiaris: When Mark brought up active/active in our capex discussions, he was talking about double the capacity. [19:47:02] halfak: for request serving ? yes if we indeed end up sending user requests to codfw as well that would hold true [19:47:13] again, currently and for the next 2 weeks it is not gonna be like that [19:47:19] akosiaris: That's exactly what's in my mind and it looks like double capacity [19:47:24] Gotcha. +1 then [19:47:31] Yeah [19:48:01] Amir1: ah there is the issue of workers. there is no double capacity for that as in some (many?) scores will have to be calculated in both dcs [19:48:20] that's what I wanted to make clear [19:49:12] beyond precaching, it should be roughly double. We've had some conversations about how geographic distribution would make sure that historical score lookups that originate from the same place would hit the same datacenter and therefor the same cache. [19:49:47] yes that's true [19:50:05] but I don't have an estimate if it is gonna be double or not [19:50:15] for large projects like enwiki or dewiki I expect it will not [19:50:20] +1 [19:50:30] for smaller projects that adhere to geographical limitations it will hold true [19:52:40] to clarify my previous point, we have 2 main datacenters to be able to survive with only one, in that sense one DC should always be able to handle the whole load [19:58:52] Jeff_Green: modules/monitoring/manifests/service.pp line 39 - 53 [20:00:02] mutante: that's too far abstracted from what's actually happening in terms of icinga config for me to understand what it's doing [20:00:41] it sets the contact_groups property [20:01:10] in the host{} or service{} clause? [20:01:14] that is exported to puppetdb and later gathered from the naggen2 script to generated the icinga config, this for prod instances [20:01:31] monitoring::service for services [20:01:31] ok [20:01:56] let me check for hosts [20:02:00] frack hosts nsca collection is configured via a flat file in the private repo [20:02:16] for hosts is modules/monitoring/manifests/host.pp [20:02:20] yes I know [20:02:40] so I would just make this change in that flat file right? [20:03:04] what do you want to change exactly? [20:04:13] i want to remove contact_group sms, so Opsen don't get alerted [20:04:41] but when I do that, I think I will also stop getting paged [20:05:00] so then I want to create an equivalent fr-tech-ops-sms contact group for critical alerts [20:06:11] so to create the new contactgroup is modules/nagios_common/files/contactgroups.cfg [20:06:49] how do I make that group only receive critical alerts? [20:09:16] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3194938 (10Jgreen) My suggestion: - stop sending frack host alerts to Tech Ops pagers - make a new contact group fr-tech-ops-sms that received only critical alerts for frack hosts - stop... [20:11:20] can we do that via the ticket? i'm happy to work on it but i would like to see written down what the goals are first and then make puppet changes. I guess the point i was trying to make is it's not trivial enough to spontaneously do it on IRC since we've identified this issue before. [20:11:54] so the default admins group has only one member, irc [20:12:45] while the people have the host-notify-by-sms-gateway and the notify-by-sms-gateway [20:13:06] and the sms group has the people directly [20:13:25] mutante: noted on the ticket [20:14:26] I've gotta run for a bit, family emergency... [20:15:45] good work on the DC switchover guys [20:18:11] Jeff_Green: similar here, but i will definitely look into it later [20:18:20] also gotta run for a bit [20:35:52] sorry to bother you guys, but if anyone has a moment can anyone tell me where i could find some documentation on Jouncebot? [20:38:43] (03PS1) 10Catrope: Set ORES thresholds for enwiki ahead of RCFilters release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349014 [20:38:58] (03PS6) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki and etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 (https://phabricator.wikimedia.org/T144458) [20:39:08] (03PS4) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) [20:39:58] Zppix: https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [20:40:06] Special:Search is your friend ;-) [20:40:37] RainbowSprinkles: i meant for params and stuff for code. [20:41:11] Then clone the repo and look at it [20:42:15] RainbowSprinkles: we must be off page here, I mean like documentation like for example i can look up on mediawiki.org and see what exactly $wgUsers means [20:42:36] Well there's nothing like that for jouncebot [20:42:40] Zppix: it's a basic bot, the answer is "read the code" :) [20:42:42] You need to look at DefaultConfig.yaml [20:42:51] okay then thats what i needed to know :) [20:42:53] https://phabricator.wikimedia.org/diffusion/GJOU/browse/master/DefaultConfig.yaml [20:43:05] thanks! [20:47:59] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3195612 (10RobH) row b done [20:58:44] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3195661 (10Eevans) All Cassandra instances on this host have been decommissioned; It can be taken down for repair at anytime and without any coordination from #services. [20:59:47] (03PS7) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki, etwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 (https://phabricator.wikimedia.org/T144458) [21:01:40] andre__, https://phabricator.wikimedia.org/p/Sarise298/ looks like some spammer? [21:01:46] randomly claiming tickets. [21:02:30] they do have some ok edits at metawiki such as https://meta.wikimedia.org/w/index.php?title=Translations:Terms_of_use/11b/sv&diff=prev&oldid=16109630 [21:05:41] if only we had a cluebot ng on phab. [21:07:37] we do, it's called Aklapper :D (joke) [21:12:14] I'm more willing to AGF on this one. [21:12:20] The underlying account dates back to 2015. [21:13:35] its still a good idea to watch the acct though [21:14:13] By all means feel free. I've already moved on [21:15:34] seems a bit like https://phabricator.wikimedia.org/p/BioPseudo/ though RainbowSprinkles, [21:16:46] I think its more suitable for this to take place in #wikimedia-devtools no? [21:17:33] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 612860 [21:20:37] [[[DatGuy]]]: I'm pretty sure it's not, outside of the "claiming a task seemingly randomly" [21:20:58] <[[[DatGuy]]]> perhaps. sorry about nick btw, inside joke in another chan :P [21:23:23] 06Operations, 05codfw-rollout: Find a way to verify mediawiki-config IPs ahead of datacenter switchovers - https://phabricator.wikimedia.org/T163354#3194516 (10tstarling) Maybe we need to verify that the configured IP addresses correspond to particular puppet roles? Service aliases in DNS could perhaps be ver... [21:24:25] 06Operations, 05codfw-rollout: Find a way to verify mediawiki-config IPs ahead of datacenter switchovers - https://phabricator.wikimedia.org/T163354#3194516 (10Zppix) Maybe instead of IPs do it via Puppet roles? [21:28:57] 06Operations, 13Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#3195740 (10GWicke) 05Open>03declined Resolving on our end, as @ArielGlenn is now working on setting up dumps. See T133547 for current work. [21:35:18] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3192172 (10Cmjohnson) If I recall these have special ssds in them correct? [21:37:26] !log krinkle@naos Synchronized php-1.29.0-wmf.20/resources/src/startup.js: I34bbe8edf - Fix js fatal (duration: 01m 20s) [21:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:08] 06Operations, 10ops-eqiad, 15User-fgiunchedi: upgrade memory in prometheus100[34] - https://phabricator.wikimedia.org/T163385#3195780 (10RobH) [21:46:11] 06Operations, 10ops-codfw, 15User-fgiunchedi: upgrade memory in prometheus200[34] - https://phabricator.wikimedia.org/T163386#3195795 (10RobH) [22:02:15] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3195874 (10chasemp) [22:02:21] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3195887 (10chasemp) p:05Triage>03Normal [22:05:31] 06Operations, 06Labs: Ensure kubelet is stopped on Tools Proxy hosts - https://phabricator.wikimedia.org/T163391#3195909 (10chasemp) [22:11:02] (03PS1) 10Catrope: Add b/c for ORES config format change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349108 (https://phabricator.wikimedia.org/T162760) [22:11:29] (03PS2) 10Catrope: Set ORES thresholds for enwiki ahead of RCFilters release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349014 [22:11:32] (03PS8) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki, etwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 (https://phabricator.wikimedia.org/T144458) [22:12:09] 06Operations, 06Labs: Ensure kubelet is stopped on Tools Proxy hosts - https://phabricator.wikimedia.org/T163391#3195948 (10chasemp) p:05Triage>03High [22:13:33] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - Core: asw-esams:xe-0/0/32 (Relined, SMF4303) [10Gbps DF CWDM C59 cwdm1-knams]BR [22:15:49] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3195949 (10chasemp) [22:16:19] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3195874 (10chasemp) [22:16:51] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3195874 (10chasemp) [22:18:26] (03PS1) 10Madhuvishy: tools-proxy: Ensure kubelet is stopped on tools proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) [22:22:40] 06Operations, 06Labs: Determinte appropriate proxy_read_timeout setting for Tools Proxy - https://phabricator.wikimedia.org/T163393#3195964 (10chasemp) [22:22:53] 06Operations, 06Labs: Determinte appropriate proxy_read_timeout setting for Tools Proxy - https://phabricator.wikimedia.org/T163393#3195976 (10chasemp) p:05Triage>03Normal [22:26:44] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3195874 (10madhuvishy) Related - https://phabricator.wikimedia.org/T143639 that documents some of this, and also has been assigned to me for a while [22:37:55] !log OS installation on db2071 [22:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:34] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 256 [22:58:58] (03PS2) 10Mattflaschen: Enable GuidedTour on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331986 (https://phabricator.wikimedia.org/T152827) (owner: 10Dereckson) [23:03:01] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3196059 (10chasemp) [23:03:56] 06Operations, 06Labs: Determinte appropriate proxy_read_timeout setting for Tools Proxy - https://phabricator.wikimedia.org/T163393#3195964 (10madhuvishy) Original task on the timeout increase from 10m to 1 hour - T120335 [23:06:19] 06Operations, 06Labs: Determine appropriate proxy_read_timeout setting for Tools Proxy - https://phabricator.wikimedia.org/T163393#3196073 (10madhuvishy) [23:12:38] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 06Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#3196085 (10GWicke) [23:13:03] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:13:21] (03PS3) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [23:13:53] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:14:22] (03CR) 10Tim Starling: "PS3: don't use MultiConfig" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [23:15:53] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:16:03] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:18:53] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [23:20:04] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [23:29:04] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:29:34] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 06Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#3196127 (10GWicke) 05Open>03Resolved a:03GWicke Changing status to resolved, as much (but not all) of the requirements discu... [23:30:53] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:32:12] (03CR) 10Tim Starling: [C: 04-1] conftool: add mwconfig object type, define the first couple variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [23:34:12] (03PS4) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [23:48:56] (03CR) 10Aaron Schulz: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [23:53:05] 06Operations, 06Labs: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3196191 (10chasemp) [23:53:14] 06Operations, 06Labs: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3196206 (10chasemp) p:05Triage>03High [23:53:37] 06Operations, 06Labs: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3196191 (10chasemp) [23:55:15] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3196208 (10chasemp) FYI @andrew labservices1001 will be caught up in this as it lives in [[ https://racktables.wikimedia.org/index.php?page=rack&... [23:59:53] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]