[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T0000). [00:00:04] ebernhardson and Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:12] (03PS4) 10Dzahn: icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) [00:00:21] I can deploy [00:00:31] I was gonna do my own patch [00:00:46] it's not quite as straight-forward as it looks [00:01:00] yea, i took a peek and i'm not comfortable shipping that without spending more time looking :P [00:01:06] go ahead Krenair [00:01:13] "abusive" :p [00:01:14] what about ebernhardson's first? [00:01:23] * ostriches totally isn't touching anything tonight [00:01:27] I'm scared right now [00:01:27] mutante, yeah, sorry, StewardBot reference :) [00:01:41] sure i'll ship mine real quick, its very straightforward [00:01:43] :) it likes it now [00:01:54] there's certain random things it picks up on and says it doesn't like the abusive characters a message contains or whatever [00:02:08] (03CR) 10EBernhardson: [C: 032] Enable deprecation logging in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334206 (https://phabricator.wikimedia.org/T156310) (owner: 10EBernhardson) [00:03:49] gotcha [00:04:20] about moving that "are they in sync" check, the interesting part is _where_ to put it, currently it's in the role. of course there is no role or module used by wt-static [00:04:27] the role = openstack [00:04:34] (03CR) 10Alex Monk: [C: 031] icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) (owner: 10Dzahn) [00:04:41] doesn't want ./misc/files/ :) [00:04:46] and stuff [00:05:32] hmm, jenkins-bot says gate pipeline build succeded, but it didn't merge [00:05:33] yeah... [00:05:42] sec rebasing [00:05:57] ebernhardson, gerrit says merge conflict [00:06:04] which is bizarre, usually jenkins picks up on that [00:06:16] (03PS2) 10EBernhardson: Enable deprecation logging in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334206 (https://phabricator.wikimedia.org/T156310) [00:06:36] (03PS3) 10Alex Monk: Merge filebackend-labs.php and filebackend-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298397 [00:06:43] (03CR) 10EBernhardson: [C: 032] Enable deprecation logging in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334206 (https://phabricator.wikimedia.org/T156310) (owner: 10EBernhardson) [00:07:02] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2970912 (10GWicke) @Tgr: That is great news. This means that we already have a means of feeding generic key-value parameters to the thumb machinery. Ar... [00:08:19] (03Merged) 10jenkins-bot: Enable deprecation logging in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334206 (https://phabricator.wikimedia.org/T156310) (owner: 10EBernhardson) [00:08:31] (03CR) 10jenkins-bot: Enable deprecation logging in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334206 (https://phabricator.wikimedia.org/T156310) (owner: 10EBernhardson) [00:09:33] !log sync 334206 to mwdebug1002 [00:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.259 second response time [00:11:33] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable deprecation logging with gerrit:334206 (duration: 00m 53s) [00:11:35] there are a lot of calls to wfDeprecated stuff :P [00:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:40] Krenair: all yours [00:12:02] thanks ebernhardson [00:12:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.692 second response time [00:12:41] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:54] (03PS1) 10Dzahn: switch install_server to carbon for initial rsync of APT repo data [puppet] - 10https://gerrit.wikimedia.org/r/334221 (https://phabricator.wikimedia.org/T843380) [00:14:44] (03PS4) 10Alex Monk: Merge filebackend-labs.php and filebackend-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298397 [00:15:11] (03CR) 10Alex Monk: [C: 032] Merge filebackend-labs.php and filebackend-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298397 (owner: 10Alex Monk) [00:16:50] (03Merged) 10jenkins-bot: Merge filebackend-labs.php and filebackend-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298397 (owner: 10Alex Monk) [00:17:51] (03CR) 10jenkins-bot: Merge filebackend-labs.php and filebackend-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298397 (owner: 10Alex Monk) [00:17:57] (03CR) 10Dzahn: [C: 04-1] "awww, why didn't i notice this earlier. install2001 has /dev/md2 mounted at /srv but install1001 does not. and both are too small for the " [puppet] - 10https://gerrit.wikimedia.org/r/334221 (https://phabricator.wikimedia.org/T843380) (owner: 10Dzahn) [00:20:01] okay [00:20:14] it seems to result in the same wgFileBackends config for prod [00:20:53] (03CR) 10Dzahn: [C: 04-1] "i need to skip or delete /srv/mirrors.off (1.4T of 1.5T) and install1001 needs to be like install2001 and get the /srv mount" [puppet] - 10https://gerrit.wikimedia.org/r/334221 (https://phabricator.wikimedia.org/T843380) (owner: 10Dzahn) [00:21:24] (03CR) 10Krinkle: [C: 04-1] Swap wmfwiki docroot to wikimediafoundation.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333974 (owner: 10Chad) [00:21:48] why does scap give a TypeError in beta [00:22:25] hey ostriches [00:22:41] https://phabricator.wikimedia.org/P4810 [00:22:47] the rest is fine [00:23:22] That should he fixed in master and live already hmm [00:23:32] thcipriani: ^ [00:23:43] btw beta still seems to work after my patch [00:23:46] prod looks the same [00:23:51] will sync to prod debug now [00:24:59] (03CR) 10Krinkle: [C: 04-1] "And indeed, depends on not-yet-drafted commit to remove docroot/foundation/logos. right now the new docroot differs from the old one." [puppet] - 10https://gerrit.wikimedia.org/r/333974 (owner: 10Chad) [00:26:55] !log sync 298397 to mwdebug1001 [00:26:56] (03CR) 10Dzahn: [C: 032] switch install_server to carbon for initial rsync of APT repo data [puppet] - 10https://gerrit.wikimedia.org/r/334221 (https://phabricator.wikimedia.org/T843380) (owner: 10Dzahn) [00:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:12] didn't it used to call us master or something? [00:27:22] I blame bryan [00:29:06] so my commit should result in the exact same resulting config for prod [00:29:20] !log krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/298397 (duration: 00m 43s) [00:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:44] er [00:30:49] so my commit should result in the exact same config for prod [00:31:03] result..resulting [00:31:10] !log krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/298397 part 2 (duration: 00m 42s) [00:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:22] well [00:31:33] the logs look ok [00:32:21] couple of commons files look fine [00:32:25] (03CR) 10Volans: "Some additional comment on the tests, see inline." (039 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [00:33:07] uploads are going through and look fine [00:33:29] ostriches, bd808: I think it's good [00:34:02] sweet [00:34:26] other than the weird scap thing in beta, it was fine in prod [00:34:31] and yeah stashbot is not subservient like adminbot was [00:35:32] adminbot had a ridiculously complicated config to say different things to different users [00:37:08] I remember :) [00:37:49] the "mistress" thing? [00:38:08] Mistress of the networking gear for Leslie Carr? [00:38:16] hehe, yea [00:39:28] { "Andrew": "junior", "RoanKattouw": "Mr. Obvious", "RobH": "RobH", [00:39:28] "notpeter": "notpeter", "domas": "o lord of the trolls, my master, my love. I ca [00:39:28] n't live with out you; oh please log to me some more!", "LeslieCarr": "Mistress [00:39:28] of the network gear.", "Alex": "Alex" } [00:39:45] lol [00:39:51] Krenair: Your beta bug is https://phabricator.wikimedia.org/D543 btw. [00:39:51] juuuniooooor [00:39:59] That's fixed in master, which should be on beta already... [00:40:23] bd808: initially that was part of 'morebots' [00:40:27] I had a patch a long time ago to change mine to "wannabe" or something :) [00:40:34] (03PS1) 10Andrew Bogott: Keystone: Assign more threads to the admin API uwsgi handler [puppet] - 10https://gerrit.wikimedia.org/r/334224 (https://phabricator.wikimedia.org/T156297) [00:40:41] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:40:42] and that was all done to have Andrew as 'junior' ;-) [00:40:57] domas: yeah. "adminbot" is just the name of deb package that it was put in [00:40:57] then others decided they can edit the file too [00:40:58] :( [00:40:59] I loved Roan's as Mr. Obvious :p [00:41:05] ebernhardson, https://gerrit.wikimedia.org/r/#/c/334223/ [00:41:11] ostriches: did I do that to him? :) [00:41:19] I don't remember [00:41:20] I didn't like having a D/s relationshipw ithe bot [00:41:24] hence i overrode it calling me master. [00:41:24] MaxSem: Thank you for that, I hadn't gotten to it [00:42:50] 07Puppet, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2971086 (10Andrew) The interesting part in that log is this: hostname: Name or service not known That means that the instance didn't get a DNS record. That see... [00:43:06] hmm... notpeter [00:43:14] (03CR) 10Chad: [C: 04-1] "Good catch inline, will amend. And yeah, still working on removing the last files...2 left that I haven't found on-wiki anywhere yet." [puppet] - 10https://gerrit.wikimedia.org/r/333974 (owner: 10Chad) [00:43:40] pyoungmeister? [00:44:11] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.04 seconds [00:47:04] (03CR) 10Andrew Bogott: [C: 032] Keystone: Assign more threads to the admin API uwsgi handler [puppet] - 10https://gerrit.wikimedia.org/r/334224 (https://phabricator.wikimedia.org/T156297) (owner: 10Andrew Bogott) [00:47:52] (03CR) 10Krinkle: "Actually, quote.png looked more like https://commons.wikimedia.org/wiki/File:Wiki-white.png given the difference in fonts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad) [00:50:31] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:53:44] (03PS1) 10Andrew Bogott: Designate sink: Don't use keystone to resolve project_id [puppet] - 10https://gerrit.wikimedia.org/r/334228 (https://phabricator.wikimedia.org/T156297) [00:55:49] (03PS1) 10RobH: setting archiva.w.o to use le 1 or 2 [puppet] - 10https://gerrit.wikimedia.org/r/334229 [00:59:48] (03CR) 10Dzahn: setting archiva.w.o to use le 1 or 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334229 (owner: 10RobH) [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T0100). [01:00:28] 07Puppet, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2971190 (10Andrew) https://gerrit.wikimedia.org/r/#/c/334224/ (add more threads) doesn't seem to help much... something must be leaking connections. I'll have to dig... [01:01:30] (03CR) 10Dzahn: [C: 04-1] "the "include challenge-nginx" should move to https virtual host, then you don't need to worry about the http->https redirect" [puppet] - 10https://gerrit.wikimedia.org/r/334229 (owner: 10RobH) [01:02:34] (03PS1) 10Mattflaschen: Note how to re-generate flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334230 [01:03:16] (03PS2) 10Mattflaschen: Note how to re-generate flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334230 [01:03:25] (03CR) 10Mattflaschen: [C: 032] "Doc-only change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334230 (owner: 10Mattflaschen) [01:04:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.274 second response time [01:05:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.646 second response time [01:05:30] (03Merged) 10jenkins-bot: Note how to re-generate flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334230 (owner: 10Mattflaschen) [01:05:47] (03CR) 10jenkins-bot: Note how to re-generate flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334230 (owner: 10Mattflaschen) [01:07:27] (03CR) 10RobH: "So this compiled just fine for the archiva host: http://puppet-compiler.wmflabs.org/5232/meitnerium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/332709 (owner: 10RobH) [01:10:53] ok this phab update is a big one... [01:11:38] :) [01:11:45] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings.php: No-op documentation change to InitialiseSettings.php (duration: 00m 46s) [01:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:53] twentyafterfour: :) good luck [01:11:59] jouncebot: next [01:11:59] In 12 hour(s) and 48 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1400) [01:12:26] logging downtime in icinga [01:13:10] (03CR) 10Dzahn: "this is the change for librenms.archiva is https://gerrit.wikimedia.org/r/#/c/334229/" [puppet] - 10https://gerrit.wikimedia.org/r/332709 (owner: 10RobH) [01:13:37] (03CR) 10RobH: "yeah i had wrong window open disreagrd my comment!" [puppet] - 10https://gerrit.wikimedia.org/r/332709 (owner: 10RobH) [01:16:17] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2971258 (10Tgr) >>! In T66214#2970912, @GWicke wrote: > Are those key-value parameters already supported by all media handlers, or is support limited t... [01:16:18] (03PS2) 10RobH: setting archiva.w.o to use le 1 or 2 [puppet] - 10https://gerrit.wikimedia.org/r/334229 [01:18:31] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [01:19:38] (03CR) 10Dzahn: [C: 031] setting archiva.w.o to use le 1 or 2 [puppet] - 10https://gerrit.wikimedia.org/r/334229 (owner: 10RobH) [01:21:24] (03CR) 10RobH: "successful puppet compiler test http://puppet-compiler.wmflabs.org/5233/meitnerium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/334229 (owner: 10RobH) [01:21:45] 06Operations, 10Traffic, 13Patch-For-Review: convert archiva to use Letsencrypt for SSL cert (deadline 2017-05-08) - https://phabricator.wikimedia.org/T154942#2971281 (10RobH) a:03RobH [01:29:11] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:29:40] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2971305 (10RobH) [01:33:43] (03PS1) 10Andrew Bogott: Revert "Keystone: Assign more threads to the admin API uwsgi handler" [puppet] - 10https://gerrit.wikimedia.org/r/334232 [01:35:01] (03CR) 10Andrew Bogott: [C: 032] Designate sink: Don't use keystone to resolve project_id [puppet] - 10https://gerrit.wikimedia.org/r/334228 (https://phabricator.wikimedia.org/T156297) (owner: 10Andrew Bogott) [01:36:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.206 second response time [01:38:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.578 second response time [01:42:31] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2971316 (10RobH) [01:42:34] 06Operations, 10ops-codfw, 10netops: codfw: mc2019-mc2036/switch port configuration - https://phabricator.wikimedia.org/T156212#2971314 (10RobH) 05Open>03Resolved all have had the port description set, enabled, and proper vlan (internal) set. [01:43:05] 07Puppet, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2971317 (10Andrew) 05Open>03Resolved a:03Andrew 334228 should resolve this issue (although not the general keystone performance problem.) Delete and recreate y... [01:46:11] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 42.49 seconds [01:46:59] (03PS2) 10Andrew Bogott: Revert "Keystone: Assign more threads to the admin API uwsgi handler" [puppet] - 10https://gerrit.wikimedia.org/r/334232 [01:47:13] !log upgrading phabricator, downtime should be minimal but expect the service to be offline for up to a few minutes [01:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:04] heh [01:52:10] i was just wondering wtf happened to phab [01:52:21] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2165 bytes in 0.010 second response time [01:52:26] sorry it's applying a rather large migration to the database [01:52:30] no worries [01:52:34] wtf I marked it as down in icinga [01:52:45] icinga hates all life [01:52:55] (03CR) 10Andrew Bogott: [C: 032] Revert "Keystone: Assign more threads to the admin API uwsgi handler" [puppet] - 10https://gerrit.wikimedia.org/r/334232 (owner: 10Andrew Bogott) [01:53:51] it clearly shows scheduled downtime hmm [01:56:05] paladox: thanks [01:56:15] your welcome. [01:56:31] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [01:57:03] andrewbogott, ^ what's going on there? [01:57:42] did you just restart keystone or something? [01:58:11] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:58:23] Applying schema adjustments... [============================= ] 51.4% [02:02:10] !log phabricator update complete [02:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:21] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 26538 bytes in 0.262 second response time [02:02:23] !log l10nupdate@tin LocalisationUpdate failed (1.29.0-wmf.8) at 2017-01-26 02:02:23+00:00 [02:02:23] !log l10nupdate@tin LocalisationUpdate failed (1.29.0-wmf.9) at 2017-01-26 02:02:23+00:00 [02:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:41] PROBLEM - MariaDB Slave IO: x1 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:04:41] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:41] PROBLEM - MariaDB Slave SQL: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:41] PROBLEM - MariaDB Slave SQL: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:41] PROBLEM - MariaDB Slave IO: s4 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:04:42] PROBLEM - MariaDB Slave SQL: m2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:42] PROBLEM - MariaDB Slave SQL: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:51] PROBLEM - MariaDB Slave IO: s6 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:04:51] PROBLEM - MariaDB Slave SQL: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:51] PROBLEM - MariaDB Slave SQL: m3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:04:51] PROBLEM - MariaDB Slave IO: m2 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:11] PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:11] PROBLEM - MariaDB Slave IO: m3 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:11] PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:11] PROBLEM - MariaDB Slave SQL: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:05:12] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:05:21] PROBLEM - MariaDB Slave IO: s2 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:21] PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:21] PROBLEM - MariaDB Slave IO: s1 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [02:05:31] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:06:11] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.25 seconds [02:07:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jan 26 02:07:05 UTC 2017 (duration 4m 42s) [02:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:41] PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 599.66 seconds [02:11:41] PROBLEM - MariaDB Slave Lag: m2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:11:41] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:11:41] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:11:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:11:41] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:11:51] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:12:01] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:12:11] PROBLEM - MariaDB Slave Lag: m3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:12:11] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:12:21] PROBLEM - MariaDB Slave Lag: x1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:17:14] twentyafterfour: re: downtime, you did schedule on iridium but icinga said "heh, there is another one on phab2001" [02:17:21] it seems [02:17:34] oh weird [02:18:09] so icinga thinks they are load-balanced then [02:19:54] twentyafterfour: this is what happens, there is a check that always checks a phabricator.wikimedia.org URL, and this check gets applied on each server using the phabricator puppet role [02:20:29] since it always checks phabricator.wm.org and not a specific backend.. and as long as there is just iridium as the single varnish backend. it effectively checks that one [02:20:59] just that it exists twice in Icinga, one per host [02:21:33] right [02:21:38] we could solve this by adding another of those "if active server" [02:21:44] yeah [02:22:22] or change the check command to actually talk to the backends [02:22:32] but what matters is what the users see.. so... [02:23:13] it also makes sense this way [02:23:14] Krenair: yes, it was keystone getting restarted by puppet [02:24:31] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:40:29] (03PS1) 10Papaul: DNS: Add mgmt and production dns for mc2019-mc2036 Bug:T155755 [dns] - 10https://gerrit.wikimedia.org/r/334234 [02:51:59] (03CR) 10Dzahn: [C: 04-1] "analytics1015 and analytics1026 are added, but shouldn't. that's because i removed those today and rebasing" [dns] - 10https://gerrit.wikimedia.org/r/334234 (owner: 10Papaul) [02:52:51] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:21] RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:21] RECOVERY - MariaDB Slave SQL: s7 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:21] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3644.11 seconds [03:03:21] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:21] RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:22] RECOVERY - MariaDB Slave IO: m3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:22] RECOVERY - MariaDB Slave Lag: m3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3755.13 seconds [03:03:23] RECOVERY - MariaDB Slave Lag: x1 on dbstore2001 is OK: OK slave_sql_lag not a slave [03:03:23] RECOVERY - MariaDB Slave IO: s5 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:24] RECOVERY - MariaDB Slave IO: s2 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:24] RECOVERY - MariaDB Slave IO: s1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:31] RECOVERY - MariaDB Slave SQL: s5 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:41] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3668.48 seconds [03:03:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 22726.48 seconds [03:03:41] RECOVERY - MariaDB Slave Lag: m2 on dbstore2001 is OK: OK slave_sql_lag not a slave [03:03:41] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3774.49 seconds [03:03:41] RECOVERY - MariaDB Slave IO: x1 on dbstore2001 is OK: OK slave_io_state not a slave [03:03:42] RECOVERY - MariaDB Slave IO: s4 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:42] RECOVERY - MariaDB Slave SQL: s4 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:43] RECOVERY - MariaDB Slave SQL: s6 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:43] RECOVERY - MariaDB Slave SQL: m2 on dbstore2001 is OK: OK slave_sql_state not a slave [03:03:44] RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:51] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state not a slave [03:03:51] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3671.49 seconds [03:03:51] RECOVERY - MariaDB Slave SQL: m3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:51] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3730.18 seconds [03:03:51] RECOVERY - MariaDB Slave SQL: s1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:03:51] RECOVERY - MariaDB Slave IO: s6 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [03:03:51] RECOVERY - MariaDB Slave IO: m2 on dbstore2001 is OK: OK slave_io_state not a slave [03:04:01] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 3834.99 seconds [03:16:27] (03PS1) 10Dzahn: aptrepo: fix rsyncd 'hosts allow' syntax [puppet] - 10https://gerrit.wikimedia.org/r/334237 (https://phabricator.wikimedia.org/T132757) [03:17:23] (03PS2) 10Dzahn: aptrepo: fix rsyncd 'hosts allow' syntax [puppet] - 10https://gerrit.wikimedia.org/r/334237 (https://phabricator.wikimedia.org/T132757) [03:18:53] (03CR) 10Dzahn: [C: 032] aptrepo: fix rsyncd 'hosts allow' syntax [puppet] - 10https://gerrit.wikimedia.org/r/334237 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [03:18:59] (03PS3) 10Dzahn: aptrepo: fix rsyncd 'hosts allow' syntax [puppet] - 10https://gerrit.wikimedia.org/r/334237 (https://phabricator.wikimedia.org/T132757) [03:21:51] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:26:31] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 741.16 seconds [03:30:03] At Enwiki, for the last ~20 minutes, I'm getting this upon [Save changes]: "We could not process your edit due to a loss of session data. Please try saving your changes again. If it still does not work, try logging out and logging back in." [03:30:05] I've tried logging-out/in, and (logging-out + clearing browser cache+history + logging-in), without improvement. [03:30:41] quiddity, using firefox >= 50 ? [03:30:59] yup. 50.1.0 [03:31:23] (Also: replicated at Frwiki, but things work ok at mw.o) [03:31:31] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 194.21 seconds [03:32:01] quiddity, https://phabricator.wikimedia.org/T151770 [03:36:24] jynus, perfect, ty! That fixed it. [03:38:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.903 second response time [03:39:03] phabricator upgrade brought down all phabricator databases [03:39:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.419 second response time [03:50:12] !log rsyncing apt.wikimedia.org data from carbon to install2001 (T84380) [03:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:19] T84380: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380 [03:55:19] (03Abandoned) 10Papaul: DNS: Add mgmt and production dns for mc2019-mc2036 Bug:T155755 [dns] - 10https://gerrit.wikimedia.org/r/334234 (owner: 10Papaul) [03:59:21] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:59:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.604 second response time [04:00:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.670 second response time [04:28:21] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:40:45] (03PS1) 10Alex Monk: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334240 (https://phabricator.wikimedia.org/T156334) [04:42:59] (03CR) 10Chad: [C: 031] Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334240 (https://phabricator.wikimedia.org/T156334) (owner: 10Alex Monk) [04:43:57] (03PS1) 10Dzahn: aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 [04:44:58] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 (owner: 10Dzahn) [04:53:41] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.109 second response time [04:57:41] (03CR) 10BearND: [C: 04-1] "to flag the question in my previous comment" [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [05:04:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:04:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [05:05:11] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:07:52] (03PS12) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [05:19:41] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.537 second response time [05:25:01] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2971494 (10Papaul) [05:27:28] 07Puppet, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2971508 (10DeltaQuad) @Andrew It appears recreation was successful, although the number of notices on the console output is crazy. Is that to be expected? [05:34:11] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:38:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.039 second response time [05:38:26] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2971513 (10Nuria) [05:39:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.533 second response time [05:58:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.940 second response time [05:59:03] icinga-wm: you lie. I see OK [06:00:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.373 second response time [06:34:23] (03PS3) 10Jcrespo: site.pp: Change active master for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:36:39] (03PS1) 10Marostegui: db-eqiad.php: Change s1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334242 (https://phabricator.wikimedia.org/T156008) [06:37:44] let's depool db1052 now [06:38:46] you send the patch? gerrit is suuuuuuper slow for me today as well :( [06:40:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.299 second response time [06:41:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.082 second response time [06:45:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1052 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334243 (https://phabricator.wikimedia.org/T156008) [06:45:25] jynus: ^ [06:45:37] (03PS1) 10Jcrespo: mariadb: reduce db1052 load before switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334244 [06:45:45] oh [06:45:46] haha [06:45:49] lol [06:46:20] deploy one of the 2, with --force [06:46:42] to see how fast we can get it [06:46:48] ok [06:47:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1052 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334243 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:47:25] I will rebase the actual switchover [06:47:54] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1052 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334243 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:48:26] (03Abandoned) 10Jcrespo: mariadb: reduce db1052 load before switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334244 (owner: 10Jcrespo) [06:48:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1052 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334243 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:49:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1052 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334243 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:49:21] pushing the depool [06:49:24] with —force [06:49:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1052 - T156008 (duration: 00m 31s) [06:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:50] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [06:49:50] 30 seconds [06:50:55] I have to manually rebase 334242 [06:51:52] it might be easier to abandon it and commit again? [06:52:04] no, please [06:52:11] let's do it right [06:52:16] :) [06:54:15] (03PS2) 10Jcrespo: db-eqiad.php: Change s1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334242 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:54:38] I have when conflicts messes with my change-ids [06:55:11] marostegui, give it a look already: https://gerrit.wikimedia.org/r/#/c/334242/2/wmf-config/db-eqiad.php [06:55:28] checking [06:55:37] looks good [06:56:36] then when happy, merge and prepare to deploy [06:56:58] can you also deploy the puppet change while I change the topology? [06:57:11] yep going to prepare that [06:57:31] also, we monopolize both mediawiki and puppet deploy around 7 UTC, warning for everybody [06:57:31] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:57:56] that is the only "special thing" happening [06:58:04] https://gerrit.wikimedia.org/r/#/c/334030/ -> give a last look? [06:58:13] ^_joe_ unless a probem happens [06:58:25] going to merge mediawikiconfig [06:59:07] (03CR) 10Jcrespo: [C: 031] site.pp: Change active master for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:59:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Change s1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334242 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [06:59:26] going to merge puppet [06:59:53] marostegui: both are true because this is intermediate? and then later you'll put just one as true? [07:00:23] (03Merged) 10jenkins-bot: db-eqiad.php: Change s1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334242 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [07:00:33] (03CR) 10jenkins-bot: db-eqiad.php: Change s1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334242 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [07:00:57] sorry, got distracted [07:01:07] wait for topology change [07:01:13] yeah [07:01:16] no worries - no pushing [07:01:22] (03CR) 10Giuseppe Lavagetto: [C: 031] "small nitpick but LGTM" (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334163 (owner: 10Ema) [07:01:28] you can merge without deploying both [07:01:38] (03CR) 10Marostegui: [C: 032] site.pp: Change active master for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [07:01:39] yeah I have already merged mediawikiconfig [07:01:43] going to merge puppet [07:02:24] puppet merged [07:03:28] jynus: I think we are good to start changing topology [07:03:42] yes, I am on it [07:03:56] ok - let me know when ready for the read_only ON [07:04:02] I got bitten by a neodymium bug [07:04:09] the mysql client one [07:05:28] you can see progress here: https://tendril.wikimedia.org/tree [07:05:39] hehe yeah i am monitoring it :) [07:06:09] can someone have a general look at mediawiki errors ? [07:06:31] <_joe_> jynus: will do [07:06:38] thanks _joe_ [07:06:56] I will be doing that in a minute [07:07:08] but this was slower than I expected [07:07:09] <_joe_> jynus: uneventful, for now [07:07:36] yep so far nothing should give errors [07:07:50] this gave errors in the past [07:08:00] and because gtid [07:08:07] it could have a regression [07:08:11] so not 100% secure [07:08:14] <_joe_> we have some DBReplication errors, but it seems not so much higher than before [07:08:14] ah, gtid.. [07:08:18] but should be ok [07:08:54] I wont be changing db1069 or the multi source ones [07:09:00] ok [07:09:03] because no gtid [07:09:26] yes, let's deal with those later [07:10:50] jynus: let's deploy the puppet change already? [07:10:57] agree [07:11:00] ok [07:11:02] deployingh [07:11:49] ok, we are done [07:12:03] puppet deployed [07:12:18] https://dbtree.wikimedia.org/ for the folks following this broadcasr online [07:12:18] topology looks good [07:12:50] wait [07:12:51] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:12:54] one last change [07:13:06] db1057 master itself [07:13:31] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:13:52] volans: can you check db1001 to see if it is related to the last puppet change? just in case? [07:13:58] <_joe_> I will [07:14:03] thanks [07:14:15] sure [07:14:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.476 second response time [07:14:25] <_joe_> marostegui: no, go on [07:14:29] cheers! [07:15:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.924 second response time [07:15:30] jynus: when ready let me know for the read_only ON once you've gathered master status, let me know for OFF + deploy [07:16:05] wait [07:16:20] I have not prepared for the circular replication [07:18:28] !log change master of db1052 from db2016 to db1052 [07:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:51] ^ you meant db1057 right? [07:18:52] !log last message was master of db1057 [07:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:55] :-) [07:18:57] :) [07:18:59] better :) [07:20:28] I'm seeing some lag in the replciation on tendril [07:20:33] yeah [07:20:41] back to normal now [07:20:43] it is the db2016 events going around [07:21:05] we will do it without replication back [07:21:31] marostegui, makes sure you get some coordinates after setting read only [07:21:41] yep [07:21:51] and lets downtime db1057 [07:22:01] https://etherpad.wikimedia.org/p/switchover -> for the coordinates [07:22:43] one last thing [07:22:47] let me downtime db1057 [07:22:48] let's disable slave_pos [07:22:53] one db1052 [07:22:56] I will do that [07:23:02] ok [07:23:23] will it work? [07:23:36] lets do it when everthing is in read only [07:23:48] ok [07:24:07] I am ready with that [07:24:15] you are ready for the read only ON? [07:24:21] yes [07:24:26] so let's recap before [07:24:48] ON -> coordinates -> disable db1052 -> coordinates db1052 -> OFF -> Deploy [07:25:11] we have to run "STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE;" on db1052 [07:25:19] just before set it to on [07:25:24] maybe you can do that [07:25:25] yep [07:25:29] I will do that [07:26:02] is heartbeat active aleady on 52? [07:26:19] no [07:26:25] lets do that now [07:26:37] it will work in read only [07:27:40] ok [07:27:42] let me start it [07:27:51] puppet run should start it [07:28:02] ok, runnign puppet [07:28:06] maybe it is already active [07:28:45] started [07:29:00] yes, I see it active [07:29:20] technically ,we have a replication outabe codfw -> eqiad [07:29:29] but we do not care about that [07:29:52] ok, so we are good to go to read only ON i believe [07:30:01] yes [07:30:14] read only on -> get coordenates -> disable gtid on db1052, get coordinates -> off -> deploy [07:30:33] go [07:30:37] ok [07:30:38] starting [07:31:05] all good [07:31:27] ready to deploy [07:32:01] jynus: all good to deploy? [07:32:27] yes [07:32:32] deploying [07:32:35] was comparing the coordinates [07:32:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change s1 master to db1057 - T156008 (duration: 00m 20s) [07:32:56] deployed [07:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:00] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [07:33:18] <_joe_> we had a peak of dbreplication errors around 7:27 FYI [07:33:36] that is expected [07:33:42] <_joe_> yeah I guessed so [07:33:44] but 200 is a good number [07:33:47] <_joe_> just reporting to be sure [07:33:48] really good [07:34:15] mediawiki seems happy and replication, too [07:34:23] so far all the slaves are happy [07:34:36] lets do some chechs and setup the codfw -> eqiad replication [07:34:41] and the db1057 replication [07:35:18] and the first checks is, can we edit/recentchanges [07:35:21] and tendril [07:35:22] (03PS7) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [07:35:30] <_joe_> elukey: back off for now [07:36:05] what time was the read only? [07:36:19] | 2017-01-26 07:30:40 | [07:36:23] _joe_ nono I am not ready, not planning to merge, thanks for the heads up [07:37:09] 7:30:39-7:32:54 [07:37:13] based on recentchanges [07:37:18] probably lower than that [07:37:32] so the 2 minute window seems plausible [07:37:47] i am having no troubles editing stuff on wikipedia [07:38:01] <_joe_> yeah I kept editing my talk page [07:38:02] congratulations [07:38:15] I saif wikitext was not that difficult! [07:38:19] :-P [07:38:26] jynus: yes, the deploy finished at 7:32:54 [07:38:30] good work [07:38:50] I actually mean, in case my joke is not understood [07:38:55] XDDD [07:39:02] but we have not finished yet [07:39:03] well done with all the topology changes, very smooth :) [07:39:06] no [07:39:13] let's monitor logstash [07:39:15] we need to swap all the multisource and dbstore delayed [07:39:24] jobs and mainteance stuff typically do not like the master change [07:39:40] oh, and several other non-critical stuff [07:39:51] <_joe_> let's see [07:40:18] <_joe_> checking a random jobrunner [07:40:39] don't like it as in, it is impossible to avoid jobs failing [07:40:41] <_joe_> seems ok [07:40:44] and on next run [07:40:49] they work [07:41:01] because long-running task are not changed instantly [07:41:03] <_joe_> yeah but a job can take up to 5 mins anyways [07:41:08] exactly [07:41:24] <_joe_> videoscaling might suffer more [07:41:28] the only fix to that [07:41:35] is to change mediawiki model [07:41:43] or move it outside of mediawiki itself [07:41:51] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:41:57] <_joe_> yeah everything seems ok now [07:42:04] <_joe_> I'll monitor things during the morning [07:42:07] <_joe_> as everyone should [07:42:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.916 second response time [07:43:11] marostegui, lets take on the asynchronous tasks [07:43:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.852 second response time [07:43:40] yep, I am sending the email so everyone is aware of that this happened and can ping us if there is something weird [07:43:49] oh, thanks [07:43:58] <_joe_> marostegui: everyone should know already, but thanks :) [07:44:16] <_joe_> marostegui: you've been a manager and it shows :D [07:44:18] * _joe_ hides [07:44:25] xdddddddddd [07:44:31] <_joe_> well done btw :) [07:44:32] buh [07:44:49] I do not know if I have been call so many terrible things [07:45:01] like joe just did :-P [07:45:05] *called [07:45:12] <_joe_> jynus: it's an insider joke; only who has been a manager can call someone else a manager [07:45:26] lol [07:45:29] hey, I have been a manager of my own company! [07:45:29] <_joe_> without a knifefight erupting, I mean [07:45:38] <_joe_> jynus: people manager [07:45:41] of 0 employees! [07:45:43] :-P [07:45:46] hahah [07:45:56] did you give yourself good performance bonuses? [07:46:03] every quarter [07:46:18] lol [07:46:30] except last week [07:46:39] where I have to fire my unique employee [07:46:42] *had [07:46:42] * Nemo_bis recalls the times when TimStarling was the only person to make a master switch [07:47:17] to be fair, I think this was a terrible job in the long term [07:47:35] but automatization is so dependent on etcs changes... [07:47:44] *etcd [07:47:45] jynus: FYI I guess that tendril need a manual adjustment [07:47:50] yes [07:47:58] not a priority yet [07:48:04] first, the topology [07:48:28] I will fix db1052's master [07:48:44] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5236" [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [07:48:45] I will let db1057 master for manuel, which depends on the coordinates he got [07:48:57] <_joe_> can we start working on puppet? [07:48:57] 06Operations, 10DBA, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2971613 (10Marostegui) This has happened already. Times in UTC: Preparation of all the code, topology changes etc: 06:30-07:30 read only on: 07:30:40 do all the necessary c... [07:49:05] <_joe_> or should we wait? [07:49:35] (03PS4) 10Giuseppe Lavagetto: wmflib: add function to calculate htpasswd entries [puppet] - 10https://gerrit.wikimedia.org/r/334125 [07:52:53] _joe_, no blockers on puppet/mediawiki [07:52:57] sorry [07:52:59] <_joe_> jynus: thanks [07:53:18] it was just before the deploy [07:53:21] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: add function to calculate htpasswd entries [puppet] - 10https://gerrit.wikimedia.org/r/334125 (owner: 10Giuseppe Lavagetto) [07:53:31] sorry I was blocking you [07:53:36] <_joe_> jynus: yeah just not to mess with your work [07:53:40] <_joe_> no you were not [07:53:52] <_joe_> I can do other things if I'm not deploying code :P [07:53:52] yeah, a rebase takes so much time [07:54:14] which is why we want to get rid of that workflow [07:54:18] so no one has ridiculous times when doing a git fetch but me and moritzm? [07:54:34] marostegui: define ridicolous :) [07:54:40] no, people complained regularly, altough [07:54:46] not specially lately [07:55:03] Receiving objects: 100% (58219/58219), 48.34 MiB | 32.00 KiB/s [07:55:11] somethes the queue gets blocked [07:55:23] there is commands somewhere [07:55:27] to monitor them [07:55:31] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [07:58:15] marostegui: for a pull --recurse-submodules it takes 6 and I don't get the receiving object for a fetch as of now [08:04:03] (03PS2) 10Muehlenhoff: Switch app servers in codfw to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/332976 (https://phabricator.wikimedia.org/T150257) [08:04:19] (03PS3) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) [08:05:13] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [08:15:00] (03PS1) 10Muehlenhoff: Switch to using openssl11 as the source package [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334245 [08:19:31] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 2.843 second response time [08:20:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.575 second response time [08:31:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [08:37:36] (03PS3) 10Jcrespo: Point analytics slaves to the right hosts [dns] - 10https://gerrit.wikimedia.org/r/326913 [08:39:21] (03CR) 10Jcrespo: [C: 032] Point analytics slaves to the right hosts [dns] - 10https://gerrit.wikimedia.org/r/326913 (owner: 10Jcrespo) [08:41:45] (03PS1) 10Jcrespo: Update s1-master CNAME to point to the new master: db1052 [dns] - 10https://gerrit.wikimedia.org/r/334248 [08:42:35] (03CR) 10Marostegui: [C: 031] Update s1-master CNAME to point to the new master: db1052 [dns] - 10https://gerrit.wikimedia.org/r/334248 (owner: 10Jcrespo) [08:45:49] (03PS2) 10Ema: Pass config file name as a CLI argument [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334163 [08:46:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [08:46:17] (03CR) 10Jcrespo: [C: 032] Update s1-master CNAME to point to the new master: db1052 [dns] - 10https://gerrit.wikimedia.org/r/334248 (owner: 10Jcrespo) [08:48:07] !log deploying dns CNAME updates due to master swithover [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:44] ssh s1-master.eqiad.wmnet [08:48:55] db1052:~$ [08:48:57] !log Change db1069 to replicate from the new s1 master db1052 - T156008 [08:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:01] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [08:49:07] ups, wrong window [08:49:33] jynus: works for me :) [08:49:41] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [08:50:11] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:50:12] (03PS3) 10Ema: Pass config file name as a CLI argument [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334163 [08:54:06] graphite1003 known ^ looking at it [08:54:41] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [08:55:11] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [08:55:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [08:57:41] !log Change db1047 to replicate from the new s1 master db1052 - T156008 [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:45] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [09:04:31] !log Change dbstore1002 to replicate from the new s1 master db1052 - T156008 [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [09:11:59] 06Operations, 10DBA, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2971729 (10Marostegui) recap of the cleanup work: dns changed for s1-master.eqiad.wmnet multisource slaves changed (only pending dbstore1001): db1047, db1069,dbstore1002 rep... [09:14:23] (03PS4) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) [09:15:46] (03PS1) 10Gilles: Upgrade to 0.1.33 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/334251 (https://phabricator.wikimedia.org/T151066) [09:17:43] (03PS1) 10Gilles: Revert "Remove broken Thumbor IP throttling from configuration" [puppet] - 10https://gerrit.wikimedia.org/r/334252 [09:20:12] * mark sees the db switch went smooth, well done guys [09:20:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 2.102 second response time [09:20:25] let's make this a more routine occurance again, with more automation too :) [09:21:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.844 second response time [09:22:57] (03PS1) 10Marostegui: s1.hosts: db1052 is the new master [software] - 10https://gerrit.wikimedia.org/r/334254 (https://phabricator.wikimedia.org/T156008) [09:23:21] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:25:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.543 second response time [09:26:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.542 second response time [09:26:25] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [09:29:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.221 second response time [09:30:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.548 second response time [09:33:21] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.015 second response time [09:34:13] (03CR) 10Jcrespo: [C: 031] s1.hosts: db1052 is the new master [software] - 10https://gerrit.wikimedia.org/r/334254 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [09:34:21] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.817 second response time [09:34:28] (03CR) 10Jcrespo: [C: 031] "Great catch." [software] - 10https://gerrit.wikimedia.org/r/334254 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [09:35:35] (03PS3) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [09:35:38] (03CR) 10Muehlenhoff: [C: 032] Switch to using openssl11 as the source package [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334245 (owner: 10Muehlenhoff) [09:39:04] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#2971835 (10fgiunchedi) Clusters where eqiad has more **hosts** than codfw. Note that the... [09:39:33] !log Enable semi-sync replication on db1052 (s1 master) - T156008 [09:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:38] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [09:40:40] (03PS1) 10Giuseppe Lavagetto: Adding private data for configcluster in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/334255 [09:43:18] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Adding private data for configcluster in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/334255 (owner: 10Giuseppe Lavagetto) [09:46:53] (03PS1) 10Jcrespo: mariadb: Move db1057 to be a regular slave on config after switch [puppet] - 10https://gerrit.wikimedia.org/r/334256 (https://phabricator.wikimedia.org/T156008) [09:48:43] (03CR) 10Muehlenhoff: [C: 04-1] "Some additional tweaks" (032 comments) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [09:51:30] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1057 to be a regular slave on config after switch [puppet] - 10https://gerrit.wikimedia.org/r/334256 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo) [09:51:31] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:51:52] (03PS4) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [09:54:59] !log Disable semi-sync on db1057 old s1 master - https://phabricator.wikimedia.org/T156008 [09:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:11] (03CR) 10Marostegui: [C: 032] s1.hosts: db1052 is the new master [software] - 10https://gerrit.wikimedia.org/r/334254 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [09:56:36] (03PS1) 10Jcrespo: prometheus-mysql-exporter: Change db1052 to be s1-master [puppet] - 10https://gerrit.wikimedia.org/r/334259 (https://phabricator.wikimedia.org/T156008) [09:57:16] (03Merged) 10jenkins-bot: s1.hosts: db1052 is the new master [software] - 10https://gerrit.wikimedia.org/r/334254 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [09:58:16] (03CR) 10Jcrespo: [C: 032] prometheus-mysql-exporter: Change db1052 to be s1-master [puppet] - 10https://gerrit.wikimedia.org/r/334259 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo) [09:58:45] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#2971903 (10fgiunchedi) re: misc, I gave a quick look at both lists of hosts and excludin... [10:00:26] (03PS2) 10Filippo Giunchedi: prometheus: add memcached aggregation and additional rules [puppet] - 10https://gerrit.wikimedia.org/r/333915 [10:01:10] 06Operations, 10DBA, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2971915 (10jcrespo) only pending: * change dbstore1001 to replicate from db1052 [10:01:24] (03PS5) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [10:03:03] (03PS8) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [10:04:04] (03CR) 10Filippo Giunchedi: "Do you need to add aqs1007 to conftool now too or as a separate step? I think both would work as hosts start disabled anyway in pybal." [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [10:04:19] (03PS6) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [10:04:30] (03CR) 10Elukey: [C: 031] prometheus: add memcached aggregation and additional rules [puppet] - 10https://gerrit.wikimedia.org/r/333915 (owner: 10Filippo Giunchedi) [10:05:52] <_joe_> uhm, gerrit very very slow [10:06:33] <_joe_> now seems better [10:06:33] seems normal speed here [10:06:37] heh [10:06:40] <_joe_> just a fluke, yes [10:07:02] <_joe_> I would bet he went to take out the garbage [10:07:08] snoozing on the job! [10:07:23] (03CR) 10Elukey: "This is a very good point, I didn't think about conftool picking up aqs1007. Since it will be added in pybal as disabled I don't see any b" [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [10:08:12] (03PS3) 10Filippo Giunchedi: prometheus: add memcached aggregation and additional rules [puppet] - 10https://gerrit.wikimedia.org/r/333915 [10:10:31] elukey: yeah should be ok to add 1007 to conftool-data in the code review too [10:12:49] (03PS7) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [10:13:03] godog: ahhh sorry I thought that it would have been added automagically, but then I remembered conftool-data in hiera :/ [10:13:30] so it might be fine to add the node afterwards [10:13:55] the thing that I want to be super sure is that the cassandra cluster will not blow up [10:16:18] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#2971958 (10fgiunchedi) a:03Cmjohnson moving to @Cmjohnson for disk replacement [10:17:05] I'm fairly sure it isn't going to, even if the default cassandra instance starts bootstrapping [10:17:48] godog: you are talking about me! There is the elukey's factor when appliying puppet patches [10:17:54] never under estimate it [10:17:58] :P [10:19:02] heheh fair enough [10:19:20] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add memcached aggregation and additional rules [puppet] - 10https://gerrit.wikimedia.org/r/333915 (owner: 10Filippo Giunchedi) [10:19:41] NICE --^ [10:19:43] thanks a lot [10:20:22] np! [10:20:28] godog: jokes aside, let me know if the patch looks good to you or if I need to make more changes, I trust your judgement after all your restbase bootstraps :D [10:22:24] elukey: yup, LGTM, do you want to put in the conftool-data change too? [10:22:58] godog: I'll do it afterwards, no real pressure now [10:23:17] (03CR) 10Filippo Giunchedi: [C: 031] Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [10:23:26] thanks! :) [10:23:32] elukey: 👍 [10:23:56] I'll wait also Eric's feedback and then I'll start with the work [10:24:13] (also I prefer to have him around when I do the first bootstrap JUST IN CASE) [10:31:03] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2971987 (10TTO) p:05Triage>03Unbreak! [10:33:11] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2971975 (10Joe) I can reproduce the problem. Any idea since when is this happening? [10:33:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [10:34:05] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2971975 (10TTO) Pages with pending changes (all these: https://de.wiktionary.org/wiki/Spezial:Seiten_mit_ungesichteten_Versionen) can't be edited or read - a totally blank page is returned... [10:35:43] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2971975 (10Schnark) Is this {T156310}? At least that task is about a fatal in FlaggedRevs in wmf.9. [10:39:54] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972013 (10Schnark) >>! In T156356#2971990, @Joe wrote: > I can reproduce the problem. Any idea since when is this happening? It obviously isn't broken in wmf.8, otherwise de.wikipedia wo... [10:41:09] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972017 (10Joe) The error is the following: ``` Jan 26 10:36:07 mwdebug1001 hhvm: #012Fatal error: Call to undefined method Revision::getText() in /srv/mediawiki/php-1.29.0-wmf.9/extensio... [10:41:53] jouncebot: next [10:41:53] In 3 hour(s) and 18 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1400) [10:41:53] 06Operations: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972018 (10Joe) >>! In T156356#2972013, @Schnark wrote: >>>! In T156356#2971990, @Joe wrote: >> I can reproduce the problem. Any idea since when is this happening? > > It obviously isn't... [10:41:57] <_joe_> hashar: ^^ [10:42:08] <_joe_> we should rollback group 1 wikis to wmf.8 [10:42:20] oh reallly :( [10:42:28] <_joe_> read the ticket [10:43:10] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972021 (10Joe) [10:43:25] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972022 (10jcrespo) [10:44:47] ^is this the right way to communicate this ? [10:45:20] that is just FlaggedRevs and &stable=1 right? [10:45:38] I desperately need a coffee to finish my wake() call [10:45:59] <_joe_> hashar: "just" that, yes [10:46:21] well it less severe than having restbase completely exploding or the whole wiki yielding white pages :] [10:46:23] <_joe_> but we need to rollback. Not in a hurry, but we need to [10:46:32] reading the ticket on phone while brewing coffee [10:46:33] <_joe_> I disagree with the former [10:46:44] and yeah i guess I will just rollback [10:46:44] <_joe_> I agree with the latter [10:46:54] :] [10:47:09] <_joe_> and, thanks. it can wait for your coffee, I think [10:47:45] brb [10:47:59] digesting the task and reading related links / coide etc [10:51:19] hashar: interestingly this is the sort of thing Phan would pick out super easily (once we get it running on all wmf deployed extensions) ;) [10:53:18] addshore: yup roger that :] [10:53:31] _joe_: there is another task with a patch by maxsem [10:53:38] gonna repro on beta [10:53:40] cherry pick the patch [10:53:42] confirm it works [10:53:44] thinking ahead, on branch could do a phan run for mediawiki & all extensions that are deployed etc ;) [10:53:45] and deploy that patch [10:53:48] so we save a rollback [10:54:11] <_joe_> hashar: or, we roll back, we test the patch approriately, apply it, test it on group 0, and re-roll forward [10:54:12] https://gerrit.wikimedia.org/r/#/c/334223/1/backend/FlaggedRevision.php [10:54:31] _joe_: yeah that save prod right now [10:54:31] <_joe_> so we minimize the already embarassing outage playing it safe [10:57:04] (03PS1) 10Juniorsys: Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) [10:57:44] (03PS1) 10Hashar: Revert "group1 wikis to 1.29.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334277 (https://phabricator.wikimedia.org/T156310) [10:57:47] (03PS1) 10Juniorsys: deployment: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334278 (https://phabricator.wikimedia.org/T93645) [10:58:40] (03PS1) 10Juniorsys: dnsrecursor: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334279 (https://phabricator.wikimedia.org/T93645) [10:58:54] hashar: I +1 what Giuseppe said, better to restore a known working state and then rollout a new version afterwards [10:59:04] yeah [10:59:15] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2972065 (10Marostegui) Hi, The pending work of: T156008 shouldn't be a blocker to replace the switch. The switchover was done, and only pending to move dbstore1001 to replicate f... [10:59:28] (03PS1) 10Juniorsys: docker: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334280 (https://phabricator.wikimedia.org/T93645) [10:59:29] that is the whole difference between ops (restore asap) and the lame hacker I pretend to be who tries to monkey patch stuff :D [10:59:46] (03CR) 10Hashar: [C: 032] Revert "group1 wikis to 1.29.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334277 (https://phabricator.wikimedia.org/T156310) (owner: 10Hashar) [11:00:02] <_joe_> hashar: to .8, I guess? [11:00:06] (03PS1) 10Juniorsys: elasticsearch: Lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334281 (https://phabricator.wikimedia.org/T93645) [11:00:15] _joe_: ah yeah the header is confusing [11:00:21] that reverts group1 to wmf.9 [11:00:26] or move them back to wmf.8 [11:00:34] <_joe_> ahah yes [11:00:36] <_joe_> sorry [11:00:46] <_joe_> I somehow missed the quotes, duh [11:01:03] I just reverted the commit mukunda had pushed yesterday [11:01:07] and double checked the diff https://gerrit.wikimedia.org/r/#/c/334277/1/wikiversions.json [11:01:09] (03PS1) 10Juniorsys: etcd: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) [11:01:46] addshore: most probably we will want phan enabled on all deployed extensions soonish :] not sure how much of nightmare it is to do so though [11:01:52] (03PS1) 10Juniorsys: eventlogging/eventstreams: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) [11:02:10] (03CR) 10Marostegui: [C: 031] site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [11:02:41] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334277 (https://phabricator.wikimedia.org/T156310) (owner: 10Hashar) [11:02:51] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334277 (https://phabricator.wikimedia.org/T156310) (owner: 10Hashar) [11:03:45] (03PS1) 10Juniorsys: extdist: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334284 [11:03:46] 06Operations, 10DBA: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2972088 (10Marostegui) a:03Marostegui [11:04:09] patch is on mwdebug1001 [11:04:23] hopefully [11:04:42] (03PS1) 10Juniorsys: gerrit/git/graphite: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334285 (https://phabricator.wikimedia.org/T93645) [11:04:47] <_joe_> hashar: lemme test [11:05:12] <_joe_> still getting the error [11:06:02] guess I screwed up the sync wikiversion [11:06:26] (03PS1) 10Juniorsys: icinga: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334286 (https://phabricator.wikimedia.org/T93645) [11:06:30] <_joe_> let me try one thing [11:06:50] <_joe_> nah, you must have done something wrong [11:06:57] yeah [11:07:01] I did just scap pull [11:07:09] but that does not seem to rebuild the wikiversion on mwdebug1001 bah [11:07:53] (03PS1) 10Juniorsys: jupterhub/keyholder: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334287 (https://phabricator.wikimedia.org/T93645) [11:08:38] (03PS1) 10Juniorsys: jenkins: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334288 (https://phabricator.wikimedia.org/T93645) [11:08:38] brion, https://phabricator.wikimedia.org/T156185 could you have a look please? [11:09:22] (03PS1) 10Juniorsys: k8s: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334289 (https://phabricator.wikimedia.org/T93645) [11:09:26] yannf: brion is on the US west coast so most definitely sleeping right now :] [11:10:00] yannf: he receives notifications for all activities on TimedMediaHandler though. So he got the mail [11:10:14] (03PS1) 10Juniorsys: labs modules linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) [11:10:41] hashar, ok thanks [11:10:47] @cli.command('wikiversions-compile', help=argparse.SUPPRESS) ! [11:11:04] (03PS1) 10Juniorsys: ldap: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334291 (https://phabricator.wikimedia.org/T93645) [11:11:27] <_joe_> hashar: do you need any assistance? else I'd go back to what I was doing before [11:11:30] _joe_: works now [11:11:37] <_joe_> hashar: ok [11:11:39] so scap pull does not refresh wikiversion.php [11:11:40] which is a bug [11:11:54] scap wikiversions-compile <-- that one doesn't show in --help [11:12:16] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: FlaggedRevs is broken in wmf.9 causing blank pages. T156356 T156310 [11:12:23] done [11:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:26] T156356: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356 [11:12:26] T156310: Fatal error: Call to undefined method Revision::getText() in extensions/FlaggedRevs/backend/FlaggedRevision.php on line 480 - https://phabricator.wikimedia.org/T156310 [11:13:01] <_joe_> yes, it seems solved [11:13:40] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team, 13Patch-For-Review: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972123 (10Joe) @hashar rolled back to wmf.8 and I can confirm the pages I was looking at now render correctly. [11:14:13] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team, 13Patch-For-Review: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972125 (10hashar) From a quick talk on IRC, @Joe recommended to rollback immediately to restore production imm... [11:14:27] 06Operations, 10MediaWiki-Releasing, 06Release-Engineering-Team, 13Patch-For-Review: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2972130 (10hashar) [11:15:22] _joe_: thanks a ton :] [11:15:32] <_joe_> np [11:18:42] and now we have a spam of undefined index in DateFormatter.php bah [11:19:51] (03PS1) 10Muehlenhoff: Add more email addressed [puppet] - 10https://gerrit.wikimedia.org/r/334292 [11:22:51] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.122 second response time [11:25:04] (03PS8) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [11:28:27] (03PS1) 10Juniorsys: librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) [11:29:12] (03PS1) 10Juniorsys: lvm/lvs: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) [11:29:35] lunch time, will test the FlaggedRevs patch on beta if all looks fine [11:29:40] will cherry pick for wmf.9 [11:29:45] and move group1 back to wmf.9 [11:29:51] (03CR) 10Muehlenhoff: [C: 032] Add more email addressed [puppet] - 10https://gerrit.wikimedia.org/r/334292 (owner: 10Muehlenhoff) [11:29:53] (03PS1) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334295 (https://phabricator.wikimedia.org/T93645) [11:30:46] (03PS1) 10Juniorsys: monitoring: linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334297 (https://phabricator.wikimedia.org/T93645) [11:31:24] (03PS1) 10Juniorsys: mysql: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) [11:32:37] <_joe_> hashar: maybe first test a group0 wiki? :P [11:32:43] (03PS1) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334299 (https://phabricator.wikimedia.org/T93645) [11:33:26] _joe_: yes definitely [11:33:39] (03PS1) 10Juniorsys: ores/otrs/package_builder: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334300 (https://phabricator.wikimedia.org/T93645) [11:34:27] (03PS1) 10Juniorsys: openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) [11:35:08] (03PS1) 10Juniorsys: phabricator: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334302 (https://phabricator.wikimedia.org/T93645) [11:36:53] (03PS1) 10Juniorsys: profile linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) [11:38:19] (03PS1) 10Muehlenhoff: Add one further account expiration/extension date [puppet] - 10https://gerrit.wikimedia.org/r/334304 [11:38:29] (03PS1) 10Alexandros Kosiaris: helium jessie DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/334305 [11:39:01] lunch & [11:39:11] (03PS1) 10Juniorsys: prometheus: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334306 (https://phabricator.wikimedia.org/T93645) [11:40:01] (03PS1) 10Juniorsys: puppet/puppet_compiler: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334307 (https://phabricator.wikimedia.org/T93645) [11:40:48] (03PS1) 10Juniorsys: planet/pmacct/programdashboard/pybal lint changes [puppet] - 10https://gerrit.wikimedia.org/r/334308 (https://phabricator.wikimedia.org/T93645) [11:41:30] (03PS1) 10Juniorsys: quarry: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) [11:42:24] (03PS2) 10Muehlenhoff: Add one further account expiration/extension date [puppet] - 10https://gerrit.wikimedia.org/r/334304 [11:43:48] (03PS1) 10Juniorsys: role: Linting changes (backup,bastionhost+others) [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) [11:46:41] (03PS1) 10Juniorsys: redis: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334311 (https://phabricator.wikimedia.org/T93645) [11:48:32] (03CR) 10Muehlenhoff: [C: 032] Add one further account expiration/extension date [puppet] - 10https://gerrit.wikimedia.org/r/334304 (owner: 10Muehlenhoff) [11:49:36] (03PS1) 10Giuseppe Lavagetto: Add SRV records for etcd peer discovery in codfw [dns] - 10https://gerrit.wikimedia.org/r/334312 (https://phabricator.wikimedia.org/T156009) [11:50:42] (03PS1) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334313 (https://phabricator.wikimedia.org/T93645) [11:50:51] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.537 second response time [11:51:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Add SRV records for etcd peer discovery in codfw [dns] - 10https://gerrit.wikimedia.org/r/334312 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [11:54:01] PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:00:42] (03PS1) 10Juniorsys: Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) [12:02:14] (03PS1) 10Juniorsys: graphoid/gridengine/grub/haproxy/hhvm lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334319 (https://phabricator.wikimedia.org/T93645) [12:03:24] (03PS1) 10Juniorsys: ifttt/imagemagick/initramfs/interface lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334320 (https://phabricator.wikimedia.org/T93645) [12:04:47] !log installing java security updates on aqs cluster [12:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:43] (03PS1) 10Juniorsys: Puppet style: Use one line per include/require [puppet] - 10https://gerrit.wikimedia.org/r/334322 [12:08:13] (03PS13) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [12:08:27] (03PS2) 10Juniorsys: Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) [12:08:44] (03PS2) 10Juniorsys: deployment: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334278 (https://phabricator.wikimedia.org/T93645) [12:09:07] (03PS2) 10Juniorsys: dnsrecursor: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334279 (https://phabricator.wikimedia.org/T93645) [12:09:19] (03PS2) 10Juniorsys: docker: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334280 (https://phabricator.wikimedia.org/T93645) [12:09:33] (03PS2) 10Juniorsys: elasticsearch: Lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334281 (https://phabricator.wikimedia.org/T93645) [12:09:46] (03PS2) 10Juniorsys: etcd: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) [12:10:00] (03PS2) 10Juniorsys: eventlogging/eventstreams: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) [12:10:21] (03PS2) 10Juniorsys: extdist: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334284 [12:10:35] (03PS2) 10Juniorsys: gerrit/git/graphite: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334285 (https://phabricator.wikimedia.org/T93645) [12:10:41] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [12:10:50] (03PS2) 10Juniorsys: icinga: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334286 (https://phabricator.wikimedia.org/T93645) [12:11:00] (03PS2) 10Juniorsys: jupterhub/keyholder: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334287 (https://phabricator.wikimedia.org/T93645) [12:11:10] (03PS2) 10Juniorsys: jenkins: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334288 (https://phabricator.wikimedia.org/T93645) [12:11:11] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:11:22] (03PS2) 10Juniorsys: k8s: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334289 (https://phabricator.wikimedia.org/T93645) [12:11:37] (03PS2) 10Juniorsys: labs modules linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) [12:11:48] (03PS2) 10Juniorsys: ldap: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334291 (https://phabricator.wikimedia.org/T93645) [12:12:05] (03PS2) 10Juniorsys: librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) [12:12:20] (03PS2) 10Juniorsys: lvm/lvs: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) [12:12:30] (03PS2) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334295 (https://phabricator.wikimedia.org/T93645) [12:12:41] (03PS2) 10Juniorsys: monitoring: linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334297 (https://phabricator.wikimedia.org/T93645) [12:12:53] (03PS2) 10Juniorsys: mysql: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) [12:13:05] (03PS2) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334299 (https://phabricator.wikimedia.org/T93645) [12:13:59] (03PS2) 10Juniorsys: ores/otrs/package_builder: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334300 (https://phabricator.wikimedia.org/T93645) [12:15:01] (03PS2) 10Juniorsys: openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) [12:15:34] (03PS2) 10Juniorsys: phabricator: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334302 (https://phabricator.wikimedia.org/T93645) [12:15:43] (03PS2) 10Juniorsys: profile linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) [12:15:58] (03PS2) 10Juniorsys: prometheus: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334306 (https://phabricator.wikimedia.org/T93645) [12:16:08] (03PS2) 10Juniorsys: puppet/puppet_compiler: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334307 (https://phabricator.wikimedia.org/T93645) [12:16:20] (03PS2) 10Juniorsys: planet/pmacct/programdashboard/pybal lint changes [puppet] - 10https://gerrit.wikimedia.org/r/334308 (https://phabricator.wikimedia.org/T93645) [12:16:29] (03PS2) 10Juniorsys: quarry: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) [12:16:40] (03PS2) 10Juniorsys: role: Linting changes (backup,bastionhost+others) [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) [12:16:50] (03PS2) 10Juniorsys: redis: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334311 (https://phabricator.wikimedia.org/T93645) [12:17:00] (03PS2) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334313 (https://phabricator.wikimedia.org/T93645) [12:17:04] !log hashar@tin Synchronized php-1.29.0-wmf.9/extensions/FlaggedRevs/backend/FlaggedRevision.php: Fix fatal in prod caused by deprecated function removal T156310 (duration: 00m 41s) [12:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:09] T156310: Fatal error: Call to undefined method Revision::getText() in extensions/FlaggedRevs/backend/FlaggedRevision.php on line 480 - https://phabricator.wikimedia.org/T156310 [12:17:12] (03PS2) 10Juniorsys: Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) [12:17:22] (03PS2) 10Juniorsys: graphoid/gridengine/grub/haproxy/hhvm lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334319 (https://phabricator.wikimedia.org/T93645) [12:17:30] (03PS2) 10Juniorsys: ifttt/imagemagick/initramfs/interface lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334320 (https://phabricator.wikimedia.org/T93645) [12:17:31] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:31] PROBLEM - swift-container-updater on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:31] PROBLEM - Disk space on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:31] PROBLEM - swift-object-replicator on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:31] PROBLEM - swift-container-auditor on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:31] PROBLEM - swift-account-auditor on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:32] PROBLEM - swift-object-auditor on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:32] PROBLEM - DPKG on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:33] PROBLEM - configured eth on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:41] PROBLEM - swift-object-updater on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:41] PROBLEM - swift-container-server on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:41] PROBLEM - MD RAID on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:51] PROBLEM - dhclient process on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:51] PROBLEM - swift-object-server on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:51] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:17:51] PROBLEM - swift-account-server on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:18:01] PROBLEM - swift-account-replicator on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:18:11] PROBLEM - Check size of conntrack table on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:18:11] PROBLEM - SSH on ms-be1013 is CRITICAL: connect to address 10.64.48.30 and port 22: Connection refused [12:18:11] PROBLEM - swift-account-reaper on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:18:11] PROBLEM - salt-minion processes on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:18:11] <_joe_> uhm looks like needs to be rebooted [12:18:11] PROBLEM - swift-container-replicator on ms-be1013 is CRITICAL: Return code of 255 is out of bounds [12:23:01] RECOVERY - puppet last run on mc1032 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:23:10] (03PS1) 10Hashar: group1 wikis back to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334328 (https://phabricator.wikimedia.org/T156310) [12:24:11] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [12:24:41] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [12:28:38] !log upgrading java on maps cluster, rolling restart of maps cluster in codfw [12:28:42] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2972279 (10Marostegui) Hi, So what I have seen: db2025,db2026 and db2027 are not present in any of our files. So they looks decommissioned (found no trace of them on puppet or mediawiki changelog... [12:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:56] !log restarting the aqs1004-a casandra instance to pick up the new openjdk [12:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:05] "casandra" [12:33:41] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.126 and port 9042: Connection refused [12:34:01] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:34:37] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#2972283 (10akosiaris) [12:34:41] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.126 port 9042 [12:34:54] sorry I thought it wouldn't have alarmed [12:35:00] it took more than expected [12:35:00] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#2972300 (10akosiaris) [12:35:02] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#2972301 (10akosiaris) [12:38:15] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#2972303 (10akosiaris) p:05Triage>03Normal [12:49:08] (03CR) 10Hashar: [C: 032] group1 wikis back to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334328 (https://phabricator.wikimedia.org/T156310) (owner: 10Hashar) [12:49:37] _joe_: (for info) FlaggedRevs is fixed for wmf.9 (tested on testwiki) so I am moving group1 back to wmf.9 [12:49:52] <_joe_> hashar: thanks for handling this [12:50:18] (03Merged) 10jenkins-bot: group1 wikis back to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334328 (https://phabricator.wikimedia.org/T156310) (owner: 10Hashar) [12:50:26] (03CR) 10jenkins-bot: group1 wikis back to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334328 (https://phabricator.wikimedia.org/T156310) (owner: 10Hashar) [12:53:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334144 [12:54:36] !log restarting the aqs1004-b casandra instance to pick up the new openjdk (last test before complete rollout) [12:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:31] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.127 and port 9042: Connection refused [12:58:31] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.127 port 9042 [12:59:19] (03CR) 10Marostegui: [C: 032] "Server ready and warm to go back to the pool" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334144 (owner: 10Marostegui) [13:00:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334144 (owner: 10Marostegui) [13:01:24] hashar: can I deploy db-eqiad.php? [13:02:01] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:02:08] !log reboot ms-be1013 [13:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:07] marostegui: in a sec [13:03:18] sorry it is taking time :( [13:03:18] hashar: ok :) [13:03:23] hashar: no worries, not in a rush [13:03:41] PROBLEM - Host ms-be1013 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:08] testing [13:05:12] PROBLEM - kartotherian endpoints health on maps2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200) [13:05:50] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 back to 1.29.0-wmf.9 T156310 [13:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:54] T156310: Fatal error: Call to undefined method Revision::getText() in extensions/FlaggedRevs/backend/FlaggedRevision.php on line 480 - https://phabricator.wikimedia.org/T156310 [13:06:46] marostegui: all good now. Sorry for the delay [13:07:01] hashar: thanks! no worries, again, i was not in a rush :) [13:07:11] RECOVERY - kartotherian endpoints health on maps2001 is OK: All endpoints are healthy [13:07:57] grmrblblblb Empty regular expression in /srv/mediawiki/php-1.29.0-wmf.9/includes/parser/DateFormatter.php on line 200 [13:08:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1054 - T156225 (duration: 00m 40s) [13:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:08] T156225: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225 [13:11:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334144 (owner: 10Marostegui) [13:12:47] (03PS1) 10Filippo Giunchedi: prometheus: fix memcached aggregation rules syntax [puppet] - 10https://gerrit.wikimedia.org/r/334329 [13:21:41] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:51] RECOVERY - MD RAID on ms-be1013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:27:51] RECOVERY - dhclient process on ms-be1013 is OK: PROCS OK: 0 processes with command name dhclient [13:27:51] RECOVERY - swift-object-server on ms-be1013 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:27:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1013 is OK: OK ferm input default policy is set [13:28:01] RECOVERY - swift-account-replicator on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:28:01] RECOVERY - swift-account-server on ms-be1013 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:28:01] RECOVERY - Host ms-be1013 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [13:28:11] RECOVERY - Check size of conntrack table on ms-be1013 is OK: OK: nf_conntrack is 7 % full [13:28:11] RECOVERY - SSH on ms-be1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [13:28:11] RECOVERY - swift-account-reaper on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:28:11] RECOVERY - salt-minion processes on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:28:11] RECOVERY - swift-container-replicator on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:28:41] RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 16.87, 5.13, 1.78 [13:28:41] RECOVERY - Disk space on ms-be1013 is OK: DISK OK [13:28:41] RECOVERY - swift-object-replicator on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:28:41] RECOVERY - swift-container-auditor on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:28:41] RECOVERY - swift-container-updater on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:28:41] RECOVERY - swift-account-auditor on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:28:42] RECOVERY - configured eth on ms-be1013 is OK: OK - interfaces up [13:28:42] RECOVERY - swift-object-auditor on ms-be1013 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:28:43] RECOVERY - DPKG on ms-be1013 is OK: All packages OK [13:28:43] RECOVERY - swift-container-server on ms-be1013 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:28:44] RECOVERY - swift-object-updater on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:29:12] 06Operations, 10ops-eqiad: ms-be1015 idrac not working, no more sessions - https://phabricator.wikimedia.org/T104161#2972424 (10fgiunchedi) I've come across this error again, now documented on wikitech how to fix it: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#T... [13:32:00] !log rolling restart of maps cluster in eqiad [13:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:11] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:41] RECOVERY - MegaRAID on ms-be1013 is OK: OK: optimal, 13 logical, 13 physical [13:38:06] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2972445 (10Cmjohnson) @fgiunchedi We have limited space in row A but could accommodate 3 10G servers, row C8 could also add 3 Row D we will have 2 10G racks and could split them 3 and 3 (once the... [13:38:51] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2972453 (10Cmjohnson) per irc godog cmjohnson1: ok, I'd say then 3x per row 10G where possible and 1G otherwise [13:46:06] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix memcached aggregation rules syntax [puppet] - 10https://gerrit.wikimedia.org/r/334329 (owner: 10Filippo Giunchedi) [13:50:41] RECOVERY - puppet last run on mc1025 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:50:42] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:07] !log restarting cassandra on aqs100[56] to complete the openjdk update [13:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:15] hashar: should I do eu swat? or do you want to? [13:54:32] !log delete labs 'instances' graphite three for data >30d, graphite low on disk space [13:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:43] (03PS2) 10Zfilipin: IP Cap Lift for Edit-a-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334134 (https://phabricator.wikimedia.org/T156258) (owner: 10Urbanecm) [13:59:28] (03PS1) 10Filippo Giunchedi: graphite: keep labs instance data for 30d [puppet] - 10https://gerrit.wikimedia.org/r/334342 (https://phabricator.wikimedia.org/T143405) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1400). [14:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:11] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:00:20] I can swat today! [14:00:36] Urbanecm: around for eu swat? [14:01:26] Im not feeling well... May I ask somebody for handling? [14:02:02] (03CR) 10Andrew Bogott: [C: 031] graphite: keep labs instance data for 30d [puppet] - 10https://gerrit.wikimedia.org/r/334342 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [14:02:10] zeljkof ping [14:02:49] Urbanecm: well, there is nothing for you to do, right? I will deploy the throttling changes and that is all [14:02:54] sounds good? [14:03:01] Yes. [14:03:07] Thanks. [14:05:53] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334134 (https://phabricator.wikimedia.org/T156258) (owner: 10Urbanecm) [14:06:31] will merge both changes, and deploy them together, since both only touch throttle.php [14:06:38] (03PS4) 10Zfilipin: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:07:59] (03Merged) 10jenkins-bot: IP Cap Lift for Edit-a-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334134 (https://phabricator.wikimedia.org/T156258) (owner: 10Urbanecm) [14:08:08] (03CR) 10jenkins-bot: IP Cap Lift for Edit-a-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334134 (https://phabricator.wikimedia.org/T156258) (owner: 10Urbanecm) [14:08:30] (03Abandoned) 10Marostegui: Revert "site.pp: Enable RBR on db1072" [puppet] - 10https://gerrit.wikimedia.org/r/333948 (owner: 10Marostegui) [14:08:31] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2972556 (10Shoichi) Excuse me, beacuse our new year is coming at 1/28 ,everyone in Taiwan is very busy in each business. It took a lot of days. >_< We h... [14:08:40] (03PS1) 10Elukey: Create a prometheus rule to calculate Memcached get hit ratio [puppet] - 10https://gerrit.wikimedia.org/r/334344 [14:08:53] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:09:00] (03CR) 10jerkins-bot: [V: 04-1] [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:10:06] Urbanecm: argh [14:10:07] (03CR) 10Alexandros Kosiaris: [C: 032] Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334313 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [14:10:09] (03CR) 10Elukey: "this is probably horribly wrong but I tried to follow https://prometheus.io/docs/querying/rules/ and what we currently have in the conf fi" [puppet] - 10https://gerrit.wikimedia.org/r/334344 (owner: 10Elukey) [14:10:13] (03PS3) 10Alexandros Kosiaris: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334313 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [14:10:16] looks like there is a conflict with 334156 [14:10:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334313 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [14:10:51] and the task says the IPs are not yet confirmed? [14:12:01] oh, the problem is that both 334134 and 334156 change the same lines in throttle.php [14:12:14] ok, will rebase manually [14:15:46] (03PS5) 10Zfilipin: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:18:41] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:21:01] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [14:21:01] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [14:21:01] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [14:21:02] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [14:21:02] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [14:21:31] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:22:46] (03Merged) 10jenkins-bot: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:22:55] (03CR) 10jenkins-bot: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) (owner: 10Urbanecm) [14:23:53] anybody working on scb? [14:25:01] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:25:01] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:25:01] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [14:25:01] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [14:25:02] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [14:26:01] errors on the host are not super good [14:28:50] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:334134|IP Cap Lift for Edit-a-Thon (T156258)]] [[gerrit:334156|[throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 (T156278)]] (duration: 00m 41s) [14:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:54] T156278: Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 - throttle rules - https://phabricator.wikimedia.org/T156278 [14:28:55] T156258: IP Cap Lift for Edit-a-Thon on 2017-02-09 - https://phabricator.wikimedia.org/T156258 [14:29:07] Urbanecm: all done [14:29:13] !log finished with eu swat [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:56] (03CR) 10Hashar: [C: 031] jenkins: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334288 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [14:32:59] (03CR) 10Hashar: [C: 031] gerrit/git/graphite: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334285 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [14:33:41] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:34:42] (03PS4) 10Alexandros Kosiaris: osm: Add a prometheus textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/331623 [14:36:41] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:38:05] (03CR) 10Alexandros Kosiaris: [C: 032] "Comment incorporated, merging per previous LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/331623 (owner: 10Alexandros Kosiaris) [14:38:13] (03PS5) 10Alexandros Kosiaris: osm: Add a prometheus textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/331623 [14:38:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm: Add a prometheus textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/331623 (owner: 10Alexandros Kosiaris) [14:39:42] 06Operations, 06Operations-Software-Development, 10Pybal, 10Traffic: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2972682 (10ema) >>! In T134893#2950312, @Volans wrote: [...] > Jan 12 13:32:19 lvs2003 pybal[23011]: [pybal] ERROR:... [14:40:12] (03PS1) 10Muehlenhoff: Update to 1.1.0d [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334350 [14:42:54] (03CR) 10Rush: [C: 031] graphite: keep labs instance data for 30d [puppet] - 10https://gerrit.wikimedia.org/r/334342 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [14:49:15] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972713 (10jcrespo) [14:50:13] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972739 (10jcrespo) p:05Triage>03High [14:51:11] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972713 (10jcrespo) I will backup current dataset and recover the last backup into db1048. [14:52:41] !log stopping mysql on db1048 T156373 [14:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] T156373: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373 [14:52:57] (03PS1) 10Andrew Bogott: Horizon: In mitaka, remove some instance-creation workflow steps [puppet] - 10https://gerrit.wikimedia.org/r/334353 [14:54:53] (03CR) 10Alexandros Kosiaris: [C: 032] helium jessie DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/334305 (owner: 10Alexandros Kosiaris) [14:54:58] (03PS2) 10Alexandros Kosiaris: helium jessie DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/334305 [14:55:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] helium jessie DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/334305 (owner: 10Alexandros Kosiaris) [14:55:40] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972746 (10Marostegui) From looking at db2012 (uses GTID). When starting slave it always tries to start at the sa... [14:55:41] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:56:11] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:56:43] (03PS2) 10Andrew Bogott: Horizon: In mitaka, remove some instance-creation workflow steps [puppet] - 10https://gerrit.wikimedia.org/r/334353 [14:57:41] ^ jynus my thought is this is from the log line a few above :) [14:58:21] ah, yeah, you can ignore [14:58:36] I shut down the broken server [14:58:43] let me ack [14:59:48] ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo T156373 [14:59:48] ACKNOWLEDGEMENT - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo T156373 [15:01:32] 06Operations, 10Traffic: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#2972767 (10BBlack) [15:02:00] I'm so happy that this is moving [15:02:08] (03CR) 10Andrew Bogott: [C: 032] Horizon: In mitaka, remove some instance-creation workflow steps [puppet] - 10https://gerrit.wikimedia.org/r/334353 (owner: 10Andrew Bogott) [15:02:41] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [15:03:11] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:03:45] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2972776 (10BBlack) [15:03:48] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#2972777 (10BBlack) [15:03:51] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2972773 (10BBlack) 05Open>03Resolved a:03BBlack Confirmed correct current operation: 1) All HTTP access seems to redirect to HTTPS 2) All HTTPS requests send response header: `str... [15:03:58] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2972778 (10BBlack) [15:05:41] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 50 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/openstack-dashboard/openstack_dashboard/dashboards/project/static/dashboard/project/workflow/launch-instance/launch-instance-workflow.service.js] [15:06:29] !log installing gnupg updates from jessie point update [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [15:07:11] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [15:07:45] (03PS2) 10Filippo Giunchedi: graphite: keep labs instance data for 30d [puppet] - 10https://gerrit.wikimedia.org/r/334342 (https://phabricator.wikimedia.org/T143405) [15:09:02] (03PS1) 10Andrew Bogott: Horizon: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/334358 [15:09:32] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] graphite: keep labs instance data for 30d [puppet] - 10https://gerrit.wikimedia.org/r/334342 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [15:09:55] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#2972787 (10BBlack) Any updates here? What we're asking for here is a modern HTTPS-only configuration. I'd think an e-commerce vendor would be all about that... [15:10:56] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2972789 (10BBlack) What about benefactorevents / eventdonations? [15:13:08] (03PS2) 10Andrew Bogott: Horizon: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/334358 [15:13:28] (03CR) 10Muehlenhoff: [C: 032] Update to 1.1.0d [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334350 (owner: 10Muehlenhoff) [15:14:09] (03PS26) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [15:14:11] (03PS26) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [15:14:13] (03PS27) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [15:14:15] (03PS11) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [15:14:23] jouncebot: next [15:14:23] In 1 hour(s) and 45 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1700) [15:15:03] !log puppet disabled on cache_misc for merging complicated stuff [15:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] !log T156242 add /dev/sdb partitions to mdadm devices [15:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:20] T156242: Degraded RAID on mw2256 - https://phabricator.wikimedia.org/T156242 [15:16:15] (03CR) 10BBlack: [V: 032 C: 032] cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [15:17:34] hi madhuvishy can you help to fix a mediawiki bug in a private wiki? Opening Special:RecentChanges gives the error: Call to undefined method LBFactory::singleton() [15:18:11] (03PS1) 10Muehlenhoff: Handle new libssl1.1 symbols introduced in 1.1.0d [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334361 [15:20:01] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:30] (03PS1) 10BBlack: post-merge fixup for 8888fea7 [puppet] - 10https://gerrit.wikimedia.org/r/334362 [15:21:42] (03CR) 10BBlack: [V: 032 C: 032] post-merge fixup for 8888fea7 [puppet] - 10https://gerrit.wikimedia.org/r/334362 (owner: 10BBlack) [15:22:41] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 31 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [15:25:32] (03PS27) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [15:26:15] (03CR) 10BBlack: [V: 032 C: 032] cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [15:27:30] (03CR) 10Andrew Bogott: [C: 032] Horizon: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/334358 (owner: 10Andrew Bogott) [15:27:36] (03PS3) 10Andrew Bogott: Horizon: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/334358 [15:27:54] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972833 (10Marostegui) This is the original crash: ``` 2017-01-26 02:00:22 7fa9cebf6700 InnoDB: FTS Optimize Rem... [15:27:55] (03Abandoned) 10BBlack: cache_misc req_handling: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [15:28:19] (03PS28) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [15:29:17] (03CR) 10BBlack: [V: 032 C: 032] cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [15:30:41] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:32:17] (03PS12) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [15:32:24] (03CR) 10BBlack: [V: 032 C: 032] cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) (owner: 10BBlack) [15:32:41] (03PS1) 10Filippo Giunchedi: graphite: add Restart / RestartSec for graphite daemons [puppet] - 10https://gerrit.wikimedia.org/r/334364 (https://phabricator.wikimedia.org/T155876) [15:33:35] (03PS4) 10Andrew Bogott: Horizon: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/334358 [15:34:09] \o/... rcstream maybe deprecated sooner rather the later :-) [15:35:28] \o/ [15:35:58] !log installing gnupg2 updates from jessie point update [15:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:42] (03PS1) 10Andrew Bogott: New labs instances: Fail more obviously if we have DNS issues [puppet] - 10https://gerrit.wikimedia.org/r/334365 [15:37:02] (03PS9) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [15:37:29] !log bounce uwsgi on graphite1003 with less workers - T155872 [15:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:33] T155872: graphite1003 short of available RAM - https://phabricator.wikimedia.org/T155872 [15:38:03] (03PS2) 10Muehlenhoff: Handle new libssl1.1 symbols introduced in 1.1.0d [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334361 [15:41:41] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:42:02] (03PS10) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [15:43:24] !log cache_misc puppet re-enabled and up to date [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:11] (03PS11) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [15:44:27] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [15:48:01] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:49:01] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:20] (03PS1) 10Papaul: DNS: Add mgmt and production dns for mc2019-mc2036 Bug:T155755 [dns] - 10https://gerrit.wikimedia.org/r/334366 [15:54:29] (03PS1) 10Alexandros Kosiaris: osm prometheus: Fix bug with hardcoded state file [puppet] - 10https://gerrit.wikimedia.org/r/334367 [15:54:49] (03PS3) 10Muehlenhoff: Handle new libssl1.1 symbols introduced in 1.1.0d [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334361 [15:55:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm prometheus: Fix bug with hardcoded state file [puppet] - 10https://gerrit.wikimedia.org/r/334367 (owner: 10Alexandros Kosiaris) [15:55:29] (03PS1) 10Ema: etcd.py: log a warning on empty responses from etcd [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) [15:55:32] (03PS2) 10Alexandros Kosiaris: osm prometheus: Fix bug with hardcoded state file [puppet] - 10https://gerrit.wikimedia.org/r/334367 [15:55:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm prometheus: Fix bug with hardcoded state file [puppet] - 10https://gerrit.wikimedia.org/r/334367 (owner: 10Alexandros Kosiaris) [15:59:21] 06Operations, 10Traffic: convert stream.wikimedia.org from GS to LE certificate - https://phabricator.wikimedia.org/T155524#2972927 (10BBlack) stream.wikimedia.org is part of cache_misc now, so if we have an expiring certificate here, I don't think we need to replace it. [16:00:48] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: depend on nginx-full, not nginx [puppet] - 10https://gerrit.wikimedia.org/r/334370 [16:02:20] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2972975 (10Papaul) [16:02:54] (03PS2) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: depend on nginx-full, not nginx [puppet] - 10https://gerrit.wikimedia.org/r/334370 [16:03:13] 06Operations, 06Analytics-Kanban, 10EventBus, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2972976 (10BBlack) cache_misc for this are all implemented and live now. The [[ https://github.com/wikimedia/operations-puppet/blob/production/mod... [16:03:33] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::tlsproxy: depend on nginx-full, not nginx [puppet] - 10https://gerrit.wikimedia.org/r/334370 (owner: 10Giuseppe Lavagetto) [16:03:45] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972979 (10Paladox) Looking at the above ^^, upstream have created a new phabricator_search table. So we no longe... [16:05:40] (03CR) 10Muehlenhoff: [C: 032] Handle new libssl1.1 symbols introduced in 1.1.0d [debs/openssl11] - 10https://gerrit.wikimedia.org/r/334361 (owner: 10Muehlenhoff) [16:07:44] 06Operations, 06Analytics-Kanban, 10EventBus, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2972982 (10Ottomata) YESSSSSSSSSSSSSSSSS awesome! Thank you! [16:08:11] PROBLEM - Check systemd state on conf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:09:22] jynus and marostegui hi, i belive you have found a bug in phabricator's search. See my comment at https://phabricator.wikimedia.org/T156373#2972979 [16:09:50] paladox: I was about to reply [16:09:53] just finishing my last test [16:09:54] oh [16:09:57] ok [16:12:53] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: fix template [puppet] - 10https://gerrit.wikimedia.org/r/334372 [16:13:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::tlsproxy: fix template [puppet] - 10https://gerrit.wikimedia.org/r/334372 (owner: 10Giuseppe Lavagetto) [16:14:11] 06Operations, 10ops-codfw, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2972992 (10elukey) [16:14:30] 06Operations, 10netops: pfws not on librenms - https://phabricator.wikimedia.org/T156381#2973000 (10ema) [16:16:01] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:16:11] RECOVERY - Check systemd state on conf2001 is OK: OK - running: The system is fully operational [16:16:45] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973018 (10Marostegui) >>! In T156373#2972979, @Paladox wrote: > Looking at the above ^^, upstream have created a... [16:20:37] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973027 (10Paladox) >>! In T156373#2973018, @Marostegui wrote: >>>! In T156373#2972979, @Paladox wrote: >> Lookin... [16:26:24] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973052 (10Paladox) I found this bug report https://jira.mariadb.org/browse/MDEV-11233 that looks very similar to... [16:29:00] 06Operations, 10Traffic: convert stream.wikimedia.org from GS to LE certificate - https://phabricator.wikimedia.org/T155524#2973059 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/334207/ [16:30:41] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:38] (03PS2) 10Ema: Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) [16:34:31] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [16:34:54] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973070 (10Marostegui) >>! In T156373#2973052, @Paladox wrote: > I found this bug report https://jira.mariadb.org... [16:36:16] (03CR) 10Eevans: [C: 031] "You'll need the ferm rules in place on the other hosts before the bootstrap of 1007-a will succeed, so order the puppet updates accordingl" [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [16:36:34] (03PS9) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) [16:36:55] (03CR) 10jerkins-bot: [V: 04-1] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) (owner: 10BryanDavis) [16:36:59] (03PS9) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [16:37:38] (03PS10) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) [16:38:14] (03PS1) 10BBlack: labtesthorizon: bugfix cache_misc routing [puppet] - 10https://gerrit.wikimedia.org/r/334375 [16:38:24] (03CR) 10BBlack: [V: 032 C: 032] labtesthorizon: bugfix cache_misc routing [puppet] - 10https://gerrit.wikimedia.org/r/334375 (owner: 10BBlack) [16:41:54] (03PS11) 10BryanDavis: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) [16:43:22] 06Operations, 10Analytics, 10netops, 13Patch-For-Review: Open temporary access from analytics vlan to new-labsdb one - https://phabricator.wikimedia.org/T155487#2973097 (10Nuria) [16:43:23] (03PS10) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [16:46:11] (03CR) 10Ottomata: [C: 031] ssl: delete stream.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/334207 (owner: 10Dzahn) [16:47:32] (03CR) 10BBlack: [C: 031] "Thanks for fixing up all the initial errors!" [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [16:50:00] 06Operations, 10Analytics, 10netops, 13Patch-For-Review: Open temporary access from analytics vlan to new-labsdb one - https://phabricator.wikimedia.org/T155487#2973154 (10elukey) 05Open>03Resolved [16:56:29] 06Operations, 10netops: pfws not on librenms - https://phabricator.wikimedia.org/T156381#2973000 (10faidon) pfw1 & pfw2 are members of each pair. They act as one control plane, so there is no reason (or way!) to add pfw1 and pfw2 separately. LibreNMS lists pfw-eqiad as "pfw1-eqiad" as before I fixed it there w... [16:57:26] (03PS1) 10Giuseppe Lavagetto: profile::etcd: listen on localhost for clients if a TLS proxy is present [puppet] - 10https://gerrit.wikimedia.org/r/334376 [16:58:41] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:58:46] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973202 (10Marostegui) The alter table worked on 10.1.21. So to recap: 10.0.23 -> works 10.0.28 -> crashes 10.0... [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1700). [17:00:04] ostriches and Reedy: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:14:49] ostriches, theres a mariadb connector https://github.com/MariaDB/mariadb-connector-j [17:14:51] 06Operations, 06Release-Engineering-Team, 13Patch-For-Review: Pages with an &stable=1 in their URL could not be viewed or edited - https://phabricator.wikimedia.org/T156356#2973299 (10greg) [17:15:07] i just found it lol, i wonder what differences it has [17:15:49] (03PS2) 10Giuseppe Lavagetto: profile::etcd: listen on localhost for clients if a TLS proxy is present [puppet] - 10https://gerrit.wikimedia.org/r/334376 [17:16:29] ostriches we should switch to it :), it is compatible with mysql too [17:16:46] (03PS5) 10Dzahn: icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) [17:17:36] (03PS6) 10Dzahn: icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) [17:21:46] (03PS3) 10Giuseppe Lavagetto: profile::etcd: listen on localhost for clients if a TLS proxy is present [puppet] - 10https://gerrit.wikimedia.org/r/334376 [17:22:04] (03CR) 10Dzahn: [C: 032] icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) (owner: 10Dzahn) [17:24:20] (03PS4) 10Giuseppe Lavagetto: profile::etcd: listen on localhost for clients if a TLS proxy is present [puppet] - 10https://gerrit.wikimedia.org/r/334376 [17:24:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd: listen on localhost for clients if a TLS proxy is present [puppet] - 10https://gerrit.wikimedia.org/r/334376 (owner: 10Giuseppe Lavagetto) [17:25:17] Hi bblack ema! How's it going? I'd like to bug you once again w/ a Varnish question, whenever you have a sec... Question is: Is the correct way to trigger a Varnish purge of some URLs just to send them in constructor of CdnCacheUpdate, then call doUpdate()? Need it to work with a few different combinations of possible URL params, but not purge all cache values for the page... Also not sure what [17:25:19] to do with the protocol [17:26:07] This is for a feature in CentralNotice to refresh the cache for specific CN banners, which are loaded via Special:BannerLoader on meta [17:27:18] thx in advance! [17:28:11] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:29:24] AndyRussG: honestly, I don't know, that's more of a MediaWiki question [17:29:46] AndyRussG: (what happens on CdnCacheUpdate in terms of input URLs or keys and output PURGE URLs) [17:30:17] Hmmm K [17:30:49] AndyRussG: I know it's normally used for articles, and there are hooks to transform/copy it (e.g. puring /wiki/Foo on desktop hostname has a hook that also does it on mobile hostname) [17:31:09] AndyRussG: so I'm not sure if it's just a generic URL mechanism, or specific to article names and such... [17:31:20] Hmm [17:31:23] I guess I should git blame to see who worked on that code... or any suggestions on specifically whom to ping? [17:31:45] !log elukey@tin Starting deploy [analytics/aqs/deploy@5917fd4]: (no message) [17:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:58] AndyRussG: I'd look at the code, personally [17:32:07] Yeah [17:33:02] https://github.com/wikimedia/mediawiki/blob/aeedfb8526e9d221553e430437a7572a6da2ba65/tests/phpunit/includes/deferred/CdnCacheUpdateTest.php [17:33:10] ^ unit tests showing off some use of it [17:33:46] Oh cool [17:33:58] Yeah I was digging into the code but wasn't clear on a few things [17:34:10] !log elukey@tin Finished deploy [analytics/aqs/deploy@5917fd4]: (no message) (duration: 02m 25s) [17:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:53] Same looking at maintenance/purgeList.php... But that test seems to have answers, at least about params! [17:35:11] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#2973356 (10Cmjohnson) labsdb1002 is one of the remaining Cisco UCS servers. [17:37:39] (03CR) 10Elukey: [C: 032] Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [17:37:45] (03PS11) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [17:37:54] Hmmm now I'm not sure what that test's doing (seems the results form update2 aren't checked...) [17:39:14] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/334385 [17:39:24] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972713 (10greg) From the description, have the phab statistics crons been disabled? [17:40:39] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::etcd::tlsproxy: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/334385 (owner: 10Giuseppe Lavagetto) [17:42:19] (03CR) 10Elukey: [V: 032 C: 032] Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [17:42:27] (03PS12) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [17:42:41] (03CR) 10Mobrovac: [C: 031] Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [17:44:21] (03CR) 10Elukey: [V: 032 C: 032] Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [17:44:50] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973390 (10Paladox) >>! In T156373#2973202, @Marostegui wrote: > The alter table worked on 10.1.21. > > So to re... [17:46:43] !log stopping pybal on lvs1001/lvs1002/lvs1003 [17:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:01] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:49:01] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:49:11] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:49:21] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:49:41] PROBLEM - pybal on lvs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:49:41] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:51:01] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:11] PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:14] !log replacing asw-c2-eqiad [17:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:25] these are all expected [17:51:31] PROBLEM - Host analytics1029 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:31] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:31] PROBLEM - Host analytics1028 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:32] ping me explicitly for anything unexpected [17:51:41] PROBLEM - Host es1015 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:51] PROBLEM - Host db1060 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:51] PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:51] PROBLEM - Host db1055 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:51] PROBLEM - Host db1056 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:51] PROBLEM - Host db1059 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:52] PROBLEM - Host db1057 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:52] PROBLEM - Host db1087 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:53] PROBLEM - Host es1016 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:53] PROBLEM - puppet last run on aqs1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:01] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:01] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:01] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/0/0 {#3458} [10Gbps DF]BR [17:52:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 208, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/1/2 {#3464} [10Gbps DF]BR [17:52:11] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:11] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:11] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:11] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:11] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:12] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:12] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:13] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: add etcd-like 401 response [puppet] - 10https://gerrit.wikimedia.org/r/334387 [17:52:20] <_joe_> ouch [17:52:21] PROBLEM - configured eth on lvs1003 is CRITICAL: eth2 reporting no carrier. [17:52:21] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:26] <_joe_> what happened to those machines? [17:52:41] PROBLEM - configured eth on lvs1002 is CRITICAL: eth2 reporting no carrier. [17:52:47] <_joe_> anyways, later [17:53:01] PROBLEM - configured eth on lvs1001 is CRITICAL: eth2 reporting no carrier. [17:53:19] 5xx on the rise btw [17:53:23] which machines? [17:53:31] <_joe_> scb* [17:53:39] good question [17:53:40] <_joe_> it's just mobileapps though [17:53:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [17:53:50] why is this causing 500s :( [17:54:02] ? [17:54:03] hmmm [17:54:04] <_joe_> enwiki is pretty slow for me now [17:54:11] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:54:20] mostly api afaics [17:54:28] well mobileapps all went out [17:54:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:54:42] RECOVERY - puppet last run on aqs1007 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:54:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:54:52] scb100[1234] [17:55:03] <_joe_> bblack: yes [17:55:11] if the mw api is out, then that's expected [17:55:11] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:55:20] same ^ [17:55:21] <_joe_> so api is not out at all [17:55:25] <_joe_> lost some traffic [17:55:30] wth? [17:55:50] mobileapps all endpoints healthy on scb1001 [17:56:01] says socket timeout [17:56:06] <_joe_> mobrovac: uhm sounds like icinga then? [17:56:10] <_joe_> https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=API%2520application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false [17:56:15] <_joe_> or maybe it's just slow [17:56:16] checking in codfw now [17:56:16] doesn't look like icinga [17:56:25] why would it be icinga [17:56:35] <_joe_> mobrovac: it might be slower than 10 seconds to check? [17:56:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [17:56:43] yes, it is [17:56:52] so why is it slow? [17:57:11] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:57:15] <_joe_> bblack: I have no idea, but when navigating some uncached pages the site was slow as well [17:57:26] well I'm sure, but the question is the root [17:57:31] <_joe_> yes [17:57:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:57:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [17:57:43] !log boostrapping aqs1007-a cassandra instance [17:57:45] something didn't like the es/db downtimes? [17:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:05] or hanging up sending stuff to dead analytics endpoints? [17:58:11] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:11] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:13] <_joe_> bblack: see the sudden drop in traffic to the API [17:58:15] no, these aren't analytics endpoints [17:58:25] just hadoop workers [17:58:31] <_joe_> I'm not sure how we can explain that [17:58:52] maybe I should start killing random racks every week or so :) [17:59:09] if only we had a clear and pretty one-page picture of which services call which services for dependencies! [17:59:11] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [17:59:11] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [17:59:11] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [17:59:11] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:11] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:23] most influx of new errors in logstash is from dbconnection and dbreplication channels btw [17:59:26] lol restbase [17:59:28] RB is obviously in the loop somewhere, but who knows cause/effect [17:59:41] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1057.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1057.eqiad.wmnet (110 Connection timed out) [17:59:49] jynus: ^^^ [17:59:51] <_joe_> ok look at the network of the scb cluster [17:59:57] <_joe_> https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520B%2520eqiad&tab=m&vn=&hide-hf=false [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1800). Please do the needful. [18:00:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:00:11] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [18:00:11] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [18:00:12] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:14] <_joe_> https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Service%2520Cluster%2520B%2520codfw&tab=m&vn=&hide-hf=false [18:00:18] Nothing for ORES today. But soooon. :) [18:00:18] <_joe_> both eqiad and codfw [18:00:47] <_joe_> ok most of the drop is from CP, it seems [18:00:48] mobileapps responds to requests, but very slowly [18:00:53] <_joe_> as in CP-generated [18:01:00] yeah doesn't look like mw liked that, lots of "server not replicating?" [18:01:07] <_joe_> mobrovac: we don't log upstream request timing anywhere? [18:01:14] <_joe_> godog: uh? [18:01:17] <_joe_> jynus: ^^ [18:01:24] _joe_: upstream req timing? [18:01:24] dbstore1001's alert is expected [18:01:27] _joe_: I'm looking at logstash [18:01:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [1000.0] [18:01:41] <_joe_> godog: can you share a link? [18:01:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:02:01] RECOVERY - MD RAID on mw2256 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:02:11] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [18:02:11] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [18:02:20] _joe_: yep https://logstash.wikimedia.org/goto/acba8ab83ede04761cf451cba485c7bb [18:02:24] (03PS1) 10Ottomata: Enable eventbus RCFeed in production and deployment-prep beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334389 (https://phabricator.wikimedia.org/T152030) [18:02:41] 06Operations, 10ops-codfw: Degraded RAID on mw2256 - https://phabricator.wikimedia.org/T156242#2973461 (10akosiaris) 05Open>03Resolved a:03akosiaris Added the second disk to the array and it has resynced and recovered. Resolving for now [18:03:11] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [18:03:11] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:11] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:28] sooo the DB barked? [18:03:46] some databases are unreachable due to the ongoing asw-c2-eqiad replacement [18:03:51] but that was expected and should have been ok [18:03:52] I have nothing for you [18:04:10] I think we should explicitly depool the affected dbs from mediawiki for now [18:04:10] jynus: what does that mean? [18:04:11] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [18:04:11] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [18:04:11] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:11] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:14] <_joe_> jynus: are those servers depooled from mediawiki? [18:04:14] can you confirm it is mediawiki and not varnish? [18:04:18] no [18:04:26] this has nothing to do with varnish [18:04:27] <_joe_> ok so that it's the problem [18:04:29] (03PS2) 10RobH: DNS: Add mgmt and production dns for mc2019-mc2036 Bug:T155755 [dns] - 10https://gerrit.wikimedia.org/r/334366 (owner: 10Papaul) [18:04:30] <_joe_> jynus: it's mediaiwki [18:04:32] why is that a problem _joe_? [18:05:04] <_joe_> paravoid: because it's probably what is slowing everything down [18:05:11] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:11] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [18:05:11] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:32] there are no varnishes whatsoever on C2, fwiw, so we can rule that out for sure [18:05:52] <_joe_> is db1055 in that row? [18:06:00] also fwiw, I'm not actively troubleshooting this, I need to be looking at the switch [18:06:02] yes [18:06:05] <_joe_> I see 500K errors of type "DBReplication" for it [18:06:11] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [18:06:11] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:12] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:12] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:12] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:12] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:12] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:13] <_joe_> paravoid: yes, I'm talking to everyone else [18:06:13] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:13] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:23] <_joe_> so I'm pretty sure if we depool those servers from mw [18:06:27] <_joe_> we're going to be ok [18:06:29] db1051 - db1060 are on C2 yes [18:06:29] (03CR) 10RobH: [C: 032] DNS: Add mgmt and production dns for mc2019-mc2036 Bug:T155755 [dns] - 10https://gerrit.wikimedia.org/r/334366 (owner: 10Papaul) [18:06:37] ok, let's depool them then [18:06:40] nah [18:06:40] <_joe_> should I? [18:06:45] the switch is going to be back any moment now [18:06:56] 503-wise this seems to be affecting only /w/api.php [18:06:57] although they should be anyway depooled by mediawiki, no ? [18:07:02] but I explicitly did not want to do that [18:07:11] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [18:07:11] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [18:07:11] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [18:07:11] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [18:07:11] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [18:07:12] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:12] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:23] * mobrovac enjoys watching paravoid playing it cool [18:07:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [18:07:49] <_joe_> akosiaris: define depooled [18:07:55] (03PS1) 10Jcrespo: Depool db1055, 56, 57, 59 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334391 [18:08:04] the switch is booting [18:08:11] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [18:08:11] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [18:08:11] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [18:08:11] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [18:08:16] _joe_: yeah point taken.. it's the timeout for connecting to one on each request that the problem clearly [18:08:18] <_joe_> paravoid: damn, I would've loved to test this theory [18:08:24] so unless something is not wired right, it will be up RSN [18:08:28] I can power it off if you want :P [18:08:34] <_joe_> nah [18:08:40] <_joe_> let's end this outage [18:08:41] what is the point of mediawiki load balancer [18:08:42] <_joe_> :) [18:08:49] <_joe_> jynus: don't ask me [18:08:54] if it diesnt depool stuff [18:09:02] it's supposed to [18:09:03] <_joe_> jynus: even if it does, it's per-request [18:09:08] <_joe_> it's php [18:09:12] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:12] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:12] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:12] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:12] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:18] <_joe_> or does it store its data somewhere? [18:09:22] <_joe_> maybe apc? [18:09:27] I expected that [18:09:33] store data on auto-depooled db servers? [18:09:36] <_joe_> I never looked at that part [18:09:37] slave lag is in memcache [18:09:43] <_joe_> oh joy [18:09:44] if everthing suffers a timeout [18:09:50] is as bad as it is down [18:09:53] PROBLEM - Host graphite1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:58] which makes it useless [18:09:58] <_joe_> wat? [18:10:01] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:01] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:01] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:05] <_joe_> paravoid: ^^ [18:10:06] uh? [18:10:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:10:11] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [18:10:12] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:12] PROBLEM - Host cp1051 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:12] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [18:10:12] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:12] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [18:10:13] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:13] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:14] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:14] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:15] so I go on with the patch? [18:10:17] well [18:10:17] It looks like you guys have much bigger problems right now, but for the record it appears OTRS is having difficulties [18:10:18] <_joe_> shiiit [18:10:22] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:10:22] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:10:22] RECOVERY - Host cp1051 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [18:10:22] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [18:10:22] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:31] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:10:31] RECOVERY - Host graphite1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:10:31] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:10:32] something bounced there - all of row C? [18:10:33] same switch [18:10:33] <_joe_> ok just a fluctuation it seems :P [18:10:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:10:51] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:10:52] RECOVERY - Host cp1099 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:10:52] PROBLEM - Host radon is DOWN: PING CRITICAL - Packet loss = 100% [18:11:02] <_joe_> radon down? [18:11:04] (03PS1) 10Ottomata: Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) [18:11:11] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [18:11:11] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [18:11:11] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [18:11:11] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [18:11:12] PROBLEM - restbase endpoints health on cerium is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [18:11:12] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [18:11:12] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [18:11:13] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [18:11:13] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [18:11:14] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [18:11:14] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [18:11:15] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [18:11:15] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [18:11:16] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [18:11:21] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [18:11:22] <_joe_> seriously, what's happening there? [18:11:29] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: add etcd-like 401 response [puppet] - 10https://gerrit.wikimedia.org/r/334387 (owner: 10Giuseppe Lavagetto) [18:11:31] RECOVERY - Host radon is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:11:39] go ahead with depooling the databases in mw-config [18:11:50] the switch is up, weird stuff are happening though [18:11:50] heh [18:11:52] asw-c2 not in mixed mode? [18:11:53] so let's do that for now [18:12:02] (03CR) 10jerkins-bot: [V: 04-1] Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [18:12:03] <_joe_> jynus: ^^ [18:12:11] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [18:12:11] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [18:12:11] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [18:12:18] mark: yeah, I'm guessing zeroize reset that :( [18:12:20] radon and cps are not on that switch and bounced [18:12:28] (03CR) 10Jcrespo: [C: 032] Depool db1055, 56, 57, 59 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334391 (owner: 10Jcrespo) [18:12:30] so the rest of C switches must've bounced for a bit? [18:12:51] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/50-server-status.conf],File[/etc/ganglia/conf.d/hhvm_mem.pyconf],File[/etc/ssh/userkeys/pybal-check] [18:12:55] bblack: the uplinks to cr*eqiad did, I don't know why [18:12:56] _joe_, in fact, in the past [18:13:01] PROBLEM - salt-minion processes on cp3012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:13:02] Jan 26 18:10:38 asw-c-eqiad /kernel: : port status changed [18:13:02] Jan 26 18:10:39 asw-c-eqiad /kernel: : port status changed [18:13:02] Jan 26 18:10:41 asw-c-eqiad /kernel: KERN_LACP_INTF_STATE_CHANGE: lacp_update_state_userspace: new state is 0x30 cifd xe-8/0/36 [18:13:05] Jan 26 18:10:42 asw-c-eqiad /kernel: KERN_LACP_INTF_STATE_CHANGE: lacp_update_state_userspace: new state is 0x30 cifd xe-8/0/38 [18:13:07] ok that explains it [18:13:08] when a server goes down, we had no issues [18:13:08] (03PS2) 10Ottomata: Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) [18:13:11] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [18:13:22] <_joe_> jynus: this is a bit different though [18:13:28] <_joe_> there is no route to the host [18:13:37] which means slow fail, not fast [18:13:40] so it lacks of a fast timeout [18:13:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [18:14:04] !log rebooting newly provisioned asw-c2-eqiad to enable mixed mode [18:14:04] <_joe_> I guess we have a timeout that is not tuned for that [18:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:11] (03Merged) 10jenkins-bot: Depool db1055, 56, 57, 59 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334391 (owner: 10Jcrespo) [18:14:26] "tuned for that" [18:14:27] (03PS3) 10Ottomata: Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) [18:14:27] well you have to be careful with that rabbithole, of replacing TCP with fast timeouts+retries [18:14:28] you mean hosts being down? [18:14:38] what is it tuned for then? :) [18:14:43] it can cause worse problems than it appears to solve [18:14:52] (03PS4) 10Ottomata: Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) [18:14:58] <_joe_> bblack: I think we went that way [18:15:07] <_joe_> for the databases specifically [18:15:21] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:24] mediawiki went that way [18:15:30] you know I want to go my way [18:15:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055, 56, 57, 59 (duration: 00m 54s) [18:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:41] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:15:42] or the highway :P [18:15:49] hmm codfw ? [18:16:02] or is it a slave for row C ? [18:16:11] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [18:16:12] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [18:16:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [18:16:13] no rdbs on C2 [18:16:21] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [18:16:21] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [18:16:30] <_joe_> ok this is interesting [18:16:35] <_joe_> paravoid: is the switch up? [18:16:41] yes [18:16:41] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3099417 keys, up 87 days 9 hours - replication_delay is 0 [18:16:51] things seem better, but not sure if because connectivity or the depool [18:16:56] <_joe_> because the traffic to API is going up [18:16:57] but also jynus merged his commit [18:17:01] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [18:17:07] <_joe_> yeah we cannot know what worked [18:17:11] RECOVERY - Juniper alarms on asw-c-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [18:17:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [18:17:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [18:17:11] RECOVERY - Host analytics1030 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:17:12] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:17:12] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:17:12] https://grafana.wikimedia.org/dashboard/db/api-summary?from=now-1h&to=now [18:17:13] I can reboot it again :P [18:17:19] (03CR) 10jenkins-bot: Depool db1055, 56, 57, 59 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334391 (owner: 10Jcrespo) [18:17:21] RECOVERY - configured eth on lvs1003 is OK: OK - interfaces up [18:17:21] RECOVERY - Host db1057 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:17:21] RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:17:21] RECOVERY - Host db1055 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [18:17:21] RECOVERY - Host analytics1029 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:17:21] RECOVERY - Host analytics1028 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:17:22] RECOVERY - Host db1056 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:17:22] RECOVERY - Host db1059 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:17:23] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [18:17:23] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:17:24] RECOVERY - Host db1087 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:17:24] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [18:17:25] RECOVERY - Host db1088 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:17:25] RECOVERY - Host es1015 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:17:35] <_joe_> paravoid: let's try in a few minutes, when api is recovered? [18:17:39] <_joe_> just out of curiosity [18:17:41] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:17:41] RECOVERY - configured eth on lvs1002 is OK: OK - interfaces up [18:18:05] there's nothing else on that switch that would explain an API slowdown anyway [18:18:10] to be fair, I think we had a miscommunication, if you had told me we were to have extended downtime [18:18:11] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [18:18:23] I would had depool in advance [18:18:27] just to be safe [18:18:39] I thought only new glitches could happen [18:18:58] in any case, the model is not very good [18:19:21] <_joe_> api is fully recovered now [18:19:21] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:23] we do not want flopping in the first ploace [18:19:26] I said it'd be 30 minutes at some point (and it was almost exactly 30 minutes :) [18:19:29] but in any case [18:19:38] I wouldn't like to depool things in advance anyway [18:19:41] PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:50] this kind of thing may happen _at any point_ [18:19:51] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:51] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:51] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:51] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:52] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Puppet has 26 failures. Last run 18 minutes ago with 26 failures. Failed resources (up to 3 shown) [18:19:52] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:52] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 24 minutes ago with 13 failures. Failed resources (up to 3 shown) [18:19:54] and last for longer than that [18:19:54] <_joe_> paravoid: let's try to reboot it again in 5? [18:20:01] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:11] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:11] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:11] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:12] not sure what the point will be [18:20:21] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:21] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:23] those are just puppet run failures [18:20:32] (03PS5) 10Ottomata: Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) [18:20:34] yeah those are expected [18:20:37] <_joe_> to see if, with the servers depooled, API would suffer or not [18:20:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [18:20:41] (03CR) 10Ottomata: "looks good: https://puppet-compiler.wmflabs.org/5246/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [18:20:46] <_joe_> if it does not, we have to fix something in mediawiki [18:20:51] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:20:51] I am ok with that [18:20:51] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:20:53] test [18:20:54] <_joe_> and how it pools/depools databases [18:20:56] better test now [18:21:01] than when it is too late [18:21:29] <_joe_> I agree with paravoid that we should expect not to need to depool the servers from mediawiki-config [18:21:36] oh, I agree too [18:21:36] <_joe_> and that it's healthy to verify what happens [18:21:43] do not get me wrong [18:21:58] I just didn't know this was going to happen [18:22:06] <_joe_> me neither [18:22:24] <_joe_> I'm looking at how we perform the db connections [18:22:31] well, depooling is the right thing to do when you know something will be in some kind of extended maintenance, or is staying dead after a failure. [18:22:33] <_joe_> and indeed we try to do a fast reconnect [18:22:44] :-( [18:22:48] but still, software should normally auto-depool / workaround issues automagically, too. [18:22:51] 06Operations, 06DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#2973497 (10akosiaris) Added dummy serials to ``` atlas-codfw, atlas-eqiad, atlas-ulsfo, br1-knams, cp3001, cp3002, dataset1001-array1, db1027, frdb1001, indium, msw-c1-eqiad, msw1-ulsfo, msw2-ulsfo, mw1010... [18:22:55] _joe_, to the same server? [18:23:01] RECOVERY - salt-minion processes on cp3012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:23:10] I thought it was to the "pool" [18:23:11] (03CR) 10Ottomata: [C: 032] Configure recentchange stream endpoint in EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/334393 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [18:23:48] 06Operations, 10ops-eqiad: Decommission old asw-c2-eqiad - https://phabricator.wikimedia.org/T156398#2973502 (10Cmjohnson) [18:24:18] 06Operations, 06DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#2973516 (10akosiaris) [18:24:33] <_joe_> anyways, we can also live without this test [18:24:51] _joe_: what test do you want to do? [18:24:51] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:24:51] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:25:10] I'm reluctant to reboot asw-c2, as this seems to have caused mayhem across all of asw-c-eqiad [18:25:21] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:25:25] he he [18:25:26] not during the whole window, just at the end when it booted [18:25:28] <_joe_> try to turn off the switch again, and see if we still have issues when those dbs are unreachable [18:25:30] and we lost cp* and radon etc. [18:25:42] but I can turn the interfaces to db1051-db1060 as down [18:25:54] of course that's kind of pointless if they're depooled in mediawiki-config, isn't it? [18:25:56] that seems more sensible to debug what happened and how to address it [18:26:04] pool them and down their interfaces [18:26:14] <_joe_> that's a possibility too [18:26:29] but we kinda know what will happen [18:26:32] but really, do we need that test to know the logic/implementation has issues? [18:26:40] yeah exactly [18:26:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:26:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:26:42] we've seen the result, we know it's bad, we can read the code [18:26:47] that ^ [18:26:51] unless you want to debug further [18:26:56] in which case, take mwdebug or some other host [18:27:00] pool those dbs back [18:27:06] <_joe_> bblack: I'm already doing that, given the depool/reboot were at the same time [18:27:06] why it works when a single server goes down? [18:27:07] and I'll turn them down on the switch [18:27:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:14] or even try pooling random IPs instead :) [18:27:22] <_joe_> paravoid: that's a good idea, maybe tomorrow? [18:27:26] <_joe_> I'm a bit tired [18:27:28] sure [18:27:32] as FYI the analytics1001 Hadoop master node got impacted and failed over to analytics1002 [18:27:40] you can even try pooling 10.0.0.1 as a database [18:27:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [18:27:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [18:27:41] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:42] (failure to contact ZK) [18:27:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:28:11] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:28:28] <_joe_> paravoid: my only doubt was, we depooled and rebooted at the same time, I'm pretty sure the depool was decisive, but yeah, I'll try that instead [18:28:39] an1001 is behind asw-c4-eqiad [18:28:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:28:51] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:28:54] <_joe_> testing appropriately with mwdebug will require some care [18:28:57] elukey: yeah, for a brief moment the whole stack got upset [18:29:07] <_joe_> as I'd need to generate an appropriate amount of traffic [18:29:27] about one and a half minute [18:29:38] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973525 (10Marostegui) >>! In T156373#2973370, @greg wrote: > From the description, have the phab statistics cron... [18:29:46] ganeti hosts appear to be in private1-c-eqiad and otrs (mendelevium on ganeti01.svc) appearing to become unreachable would be explained by the network issues in that row, right? [18:29:56] Krenair: that's correct [18:30:07] I did see an error at one point about the daemon not running [18:30:10] which was strange [18:30:38] it sorted itself out though [18:30:52] a solid proxy that handles flopping rationally [18:30:53] <_joe_> conf1* etcd had to elect a leader [18:30:56] _joe_: there was literally nothing else on that switch other than db105* that would explain this [18:30:59] and lot and lots of testing [18:30:59] <_joe_> during that flapping [18:31:17] <_joe_> so ema, bblack : I'd check pybals in eqiad/esams [18:31:29] for? [18:31:41] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:31:42] Krenair: I think the daemon state (running/not running) is somehow written to the database. That's so the multiple appservers know (if you have multiple) since you only need 1 daemon running [18:31:43] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:31:43] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:31:48] ah [18:31:51] PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:32:00] yeah database connectivity could've been lost [18:32:11] !log starting pybal on lvs1001/lvs1002/lvs1003 [18:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:23] <_joe_> paravoid: whenever they lose connection to etcd, there is a limited chance that they'll stop listening to etcd changes [18:32:28] <_joe_> ema is working on it [18:32:33] _joe_: yeah I don't know how to know that, except to test depools [18:32:41] RECOVERY - pybal on lvs1003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:32:41] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:32:43] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [18:32:46] <_joe_> bblack: looking at logs usually shows that [18:32:53] 'usually' :) [18:33:01] RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy [18:33:01] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [18:33:04] we've seen a bunch of different forms of fallout, sometimes it's silent [18:33:11] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:33:11] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:33:20] * _joe_ off [18:33:21] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:33:40] 07Puppet, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2973537 (10Andrew) I'm not sure I know what you mean by the 'console output'. If you're talking about the system log then, yes, it always tells you a lot. [18:33:54] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973539 (10greg) I don't have the rights, I believe @jcrespo did it (with verification from @mmodell) last time w... [18:34:28] (03PS2) 10Andrew Bogott: New labs instances: Fail more obviously if we have DNS issues [puppet] - 10https://gerrit.wikimedia.org/r/334365 [18:34:30] (03PS1) 10Andrew Bogott: Horizon: Remove more steps from the mitaka instance creation workflow [puppet] - 10https://gerrit.wikimedia.org/r/334399 [18:34:32] I am taking care of the hadoop nodes [18:34:38] (03CR) 10Fjalapeno: [C: 031] Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [18:34:41] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:35:02] delayed alarms that I set this week, the daemons went down before [18:35:14] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973540 (10Marostegui) >>! In T156373#2973390, @Paladox wrote: > > Maybe we should find the patch that fixed it... [18:36:01] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:36:16] !log restarting Yarn node managers on an102[89] and an103[01], impacted by the switch restart [18:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [18:37:02] PROBLEM - swift eqiad-prod container availability on graphite1001 is CRITICAL: CRITICAL: 65.00% of data under the critical threshold [88.0] [18:37:21] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:37:38] (03CR) 10Andrew Bogott: [C: 032] Horizon: Remove more steps from the mitaka instance creation workflow [puppet] - 10https://gerrit.wikimedia.org/r/334399 (owner: 10Andrew Bogott) [18:37:41] RECOVERY - Hadoop NodeManager on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:37:43] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:37:51] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:38:21] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:38:51] RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:38:53] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:39:37] (03PS1) 10Brion VIBBER: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) [18:39:41] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:40:01] (03PS1) 10Ottomata: Fix multiline yaml parsing in eventstreams stream description [puppet] - 10https://gerrit.wikimedia.org/r/334402 [18:42:11] RECOVERY - puppet last run on db1088 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:43:51] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:45:01] PROBLEM - swift eqiad-prod container availability on graphite1001 is CRITICAL: CRITICAL: 48.15% of data under the critical threshold [88.0] [18:47:50] (03CR) 10Ottomata: [V: 032 C: 032] Fix multiline yaml parsing in eventstreams stream description [puppet] - 10https://gerrit.wikimedia.org/r/334402 (owner: 10Ottomata) [18:49:11] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2973625 (10mobrovac) 05Open>03Resolved All of the services but Maps have been upgraded to Node 6, so I'm declaring victory here. Thanks to everyone that helped! The Maps... [18:49:47] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973641 (10Paladox) @Marostegui hi, i was talking about doing it upstream. [18:50:01] RECOVERY - swift eqiad-prod container availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [92.0] [18:52:54] !log otto@tin Starting deploy [eventstreams/deploy@f1a1866]: (no message) [18:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:11] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:56:09] !log otto@tin Finished deploy [eventstreams/deploy@f1a1866]: (no message) (duration: 03m 16s) [18:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:01] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:01] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:01] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:01] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:01] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:01] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:02] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:02] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:03] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:03] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:04] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:04] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:05] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:09] (03PS1) 10Faidon Liambotis: Remove Krenair's access [puppet] - 10https://gerrit.wikimedia.org/r/334406 [18:57:21] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:21] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:21] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:21] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:31] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:31] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:12] (03CR) 10Faidon Liambotis: [C: 032] Remove Krenair's access [puppet] - 10https://gerrit.wikimedia.org/r/334406 (owner: 10Faidon Liambotis) [18:58:28] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2973700 (10akosiaris) 05Resolved>03Open Actually, there's etherpad left. I 'll do the upgrade tomorrow though :-). Reopening in the meantime [18:59:07] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Remove Krenair's access [puppet] - 10https://gerrit.wikimedia.org/r/334406 (owner: 10Faidon Liambotis) [18:59:51] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [19:00:01] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T1900). [19:00:10] jouncebot: Go away [19:00:11] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:11] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [19:00:11] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:11] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:12] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:00:12] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:00:12] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:21] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:21] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:00:21] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:00:29] heh [19:00:51] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:51] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:00:51] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:00:51] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:51] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:00:52] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:00:52] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:53] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:00:53] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:07:31] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:31] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:31] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:31] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:31] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:31] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:31] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:01] (03PS5) 10Dzahn: site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) [19:08:02] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:02] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:03] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:03] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:12] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:17] mmm [19:08:21] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:21] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:09:26] not sure what is going on [19:09:29] another backup? [19:09:35] a crash? [19:10:01] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:11] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:11] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [19:10:21] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:21] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:10:22] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:10:22] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:22] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:10:22] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:10:22] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:51] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:10:51] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:10:51] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:51] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:10:51] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:10:51] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:52] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [19:10:52] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:53] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:10:53] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [19:11:28] maybe a hiccup after the network issue? [19:13:11] !log restore analytics1001 as RM and HDFS masters [19:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:36] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972713 (10epriestley) Not sure this is helpful, but the `phabricator_search.search_documentfield` table is just... [19:21:11] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:22:27] (03Abandoned) 10RobH: update icinga cert check to letsencrypt for librenms [puppet] - 10https://gerrit.wikimedia.org/r/332709 (owner: 10RobH) [19:22:45] (03Abandoned) 10RobH: lost a PDU tower in ulsfo 1.22 [dns] - 10https://gerrit.wikimedia.org/r/333931 (owner: 10RobH) [19:22:53] (03Abandoned) 10RobH: Revert "adding ssl monitoring for wikitech-static" [puppet] - 10https://gerrit.wikimedia.org/r/334179 (owner: 10RobH) [19:23:08] jynus: need help? [19:25:00] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973770 (10jcrespo) @epriestley we are already doing that in parallel. But if it happens to be the... [19:25:59] volans, where? [19:26:10] dbstore1001? [19:26:16] yeah [19:26:19] no, it was just a hiccup [19:26:53] ok, just checking, I'm a bit late with all the backlogs [19:27:11] do not worry too much about the dbs [19:27:18] I appreciate it, don't take me wrong [19:27:32] but you are 1000x more valuable with the automatization stuff [19:27:33] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973783 (10Paladox) @jcrespo and @Marostegui i found the patch https://github.com/MariaDB/server/co... [19:27:37] and we depend on that a lot [19:28:09] (03CR) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [19:28:11] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:28:13] jynus: thanks! but since I'm on the other timezone for just another few days, happy to help taking over things, you had pretty long days recently [19:28:27] I thank you for that [19:28:30] (03PS12) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) [19:28:31] not really [19:33:16] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2973799 (10GWicke) [19:34:00] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973802 (10Paladox) I got some merge conflicts when cherry picking to the 10.0 branch ~/server$ gi... [19:36:24] (03PS1) 10Faidon Liambotis: More Krenair access removals [puppet] - 10https://gerrit.wikimedia.org/r/334419 [19:37:23] (03CR) 10Faidon Liambotis: [V: 032 C: 032] More Krenair access removals [puppet] - 10https://gerrit.wikimedia.org/r/334419 (owner: 10Faidon Liambotis) [19:37:45] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973806 (10epriestley) @jcrespo Ah, sorry, I hadn't actually read the linked bug. As another possi... [19:39:32] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973808 (10Paladox) Actually, it's very easy to fix merge conflicts. [19:40:28] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973809 (10Paladox) remove this ``` -<<<<<<< HEAD - /* One variable length column, word... [19:52:41] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973904 (10jcrespo) As usual, @epriestley, I thank you a lot for the support (thank you so much!),... [19:57:05] (03CR) 10BBlack: [C: 031] Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [20:00:05] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T2000). Please do the needful. [20:00:26] chu chu [20:03:41] (03CR) 10BearND: [C: 031] Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [20:03:46] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973962 (10jcrespo) https://jira.mariadb.org/browse/MDEV-11918 [20:04:39] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973965 (10Paladox) @jcrespo and @marostegui I've back ported the fix here https://github.com/Maria... [20:12:44] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973994 (10jcrespo) >>! In T156373#2973965, @Paladox wrote: > @jcrespo and @marostegui I've back po... [20:13:10] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2973995 (10Paladox) Ok, your welcome :) [20:16:52] (03PS1) 10Volans: Icinga: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/334423 (https://phabricator.wikimedia.org/T156294) [20:20:15] (03CR) 10jerkins-bot: [V: 04-1] Icinga: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/334423 (https://phabricator.wikimedia.org/T156294) (owner: 10Volans) [20:21:09] (03PS2) 10Volans: Icinga: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/334423 (https://phabricator.wikimedia.org/T156294) [20:22:41] (03PS1) 10MarcoAurelio: Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) [20:23:56] (03CR) 10Volans: [C: 032] Icinga: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/334423 (https://phabricator.wikimedia.org/T156294) (owner: 10Volans) [20:26:11] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:30:19] (03CR) 10Ema: "Thanks for the reviews, I'll merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [20:30:47] !log refreshing logins on wikitech [20:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:35] (03PS1) 10Volans: etcd: add missing group definition for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/334426 (https://phabricator.wikimedia.org/T156009) [20:43:01] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:44:27] (03CR) 10Volans: [C: 032] etcd: add missing group definition for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/334426 (https://phabricator.wikimedia.org/T156009) (owner: 10Volans) [20:49:16] (03PS1) 10Volans: Icinga: fix contact group membership [puppet] - 10https://gerrit.wikimedia.org/r/334428 (https://phabricator.wikimedia.org/T156294) [20:50:26] (03CR) 10Volans: [C: 032] Icinga: fix contact group membership [puppet] - 10https://gerrit.wikimedia.org/r/334428 (https://phabricator.wikimedia.org/T156294) (owner: 10Volans) [20:54:09] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [20:54:42] finally [20:55:19] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:01:02] (03CR) 10Chad: "Dangit. Maybe I should just upload all of the logos somewhere then so we're doubly sure they're still somewhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad) [21:02:59] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3111249 keys, up 87 days 12 hours - replication_delay is 616 [21:03:59] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3111097 keys, up 87 days 12 hours - replication_delay is 645 [21:04:59] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3101977 keys, up 87 days 12 hours - replication_delay is 0 [21:05:59] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3101804 keys, up 87 days 12 hours - replication_delay is 0 [21:11:09] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:23:19] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:32:09] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:09] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.444 second response time [21:49:32] (03CR) 10Fjalapeno: [C: 031] "Ema thanks for pushing this through!" [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [21:51:11] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links, 13Patch-For-Review: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2974262 (10Fjalapeno) @JMinor @JoeWalsh the fix for this is... [21:51:49] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:51:53] (03CR) 10Dereckson: "Note to test this at deployment time: check a category page on mwdebug1002, they produce reliable fatal errors when the collation code is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [21:53:13] (03PS1) 10Madhuvishy: [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [21:53:20] anybody know what's up with deploy train? should wmf.9 have gone out yet, or am I mistaken? [21:53:32] Running a little late today [21:54:12] (03CR) 10MarcoAurelio: [C: 031] "Looks good technically speaking." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332329 (https://phabricator.wikimedia.org/T152296) (owner: 10Urbanecm) [21:54:22] (03CR) 10jerkins-bot: [V: 04-1] [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 (owner: 10Madhuvishy) [21:56:52] (03CR) 10MarcoAurelio: [C: 031] "Please also note the discussion in progress at T151408 about bots with sysop rights." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332329 (https://phabricator.wikimedia.org/T152296) (owner: 10Urbanecm) [21:59:48] (03PS2) 10Madhuvishy: [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [22:00:15] ottomata: still trying to decide if it's safe to roll forward [22:00:31] There are blocking issues, e.g. https://phabricator.wikimedia.org/T156364 [22:00:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 (owner: 10Madhuvishy) [22:00:57] uhh huh! [22:00:57] ok [22:00:58] thanks [22:01:09] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:01:10] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.121 second response time [22:04:46] !log mobrovac@tin Starting deploy [trending-edits/deploy@e0e32bb]: Bump replay time to 6h for T156411 [22:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:51] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [22:06:28] !log mobrovac@tin Finished deploy [trending-edits/deploy@e0e32bb]: Bump replay time to 6h for T156411 (duration: 01m 42s) [22:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:52] (03PS1) 10Ottomata: Add hardsync shell script [puppet] - 10https://gerrit.wikimedia.org/r/334435 (https://phabricator.wikimedia.org/T125854) [22:19:49] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:21:19] (03PS3) 10Madhuvishy: [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [22:22:15] (03CR) 10Dzahn: "eh.. how come it's "members" for the other groups above but not for this one?" [puppet] - 10https://gerrit.wikimedia.org/r/334428 (https://phabricator.wikimedia.org/T156294) (owner: 10Volans) [22:22:17] (03PS4) 10Madhuvishy: [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [22:23:32] (03CR) 10Volans: "Because admins is a contactgroup itself, not a contact ;)" [puppet] - 10https://gerrit.wikimedia.org/r/334428 (https://phabricator.wikimedia.org/T156294) (owner: 10Volans) [22:25:28] (03CR) 10Dzahn: "ah :) thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/334428 (https://phabricator.wikimedia.org/T156294) (owner: 10Volans) [22:33:49] (03PS6) 10Dzahn: site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) [22:37:31] Fatal error: File not found: /srv/mediawiki/docroot/foundation/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/foundation/w/robots.php on line 2 [22:37:40] Fatal error: File not found: /srv/mediawiki/docroot/wikivoyage.org/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/wikivoyage.org/w/favicon.php on line 2 [22:37:56] Fatal error: Class undefined: Orgs in /srv/mediawiki/php-1.29.0-wmf.8/extensions/EducationProgram/includes/api/ApiRefreshEducation.php on line 44 [22:39:03] twentyafterfour: on tin? or elsewhere? [22:39:26] mw1177 [22:39:32] that was hhvm cache issues yesterday I think. ostriches ran into it somewhere [22:39:33] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2974386 (10GWicke) [22:39:39] mw1181 [22:40:02] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781285 (10GWicke) [22:40:02] these are low frequency fatals normally go under the radar [22:40:04] mw1181, mw1272, mw1212, mw1174 [22:40:17] Spotted those yesterday, hhvm restart on those nodes fixes it [22:40:33] But I'm worried this is a wider problem waiting to rear its head, I had planned to work on it today but other things got in the way [22:40:45] It's those 4 nodes right now though [22:41:09] mutante: Can we get an hhvm restart on those 4 apaches? ^ [22:43:44] must be symlink resolution problem... [22:44:16] Yeah [22:44:33] (03CR) 10Dzahn: [C: 032] site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [22:44:39] ostriches: yep [22:44:41] Sometimes HHVM doesn't resolve it where we want/expect, and it gets cached as such [22:45:00] Funny it only seems to hit weird less common entries [22:45:06] favicon, robots, extract2 [22:45:16] I would expect api/index/load would blow up much louder [22:46:00] Ultimate solution is to remove some of the symlink weirdness so things are more straightforward. [22:46:12] !log mw1181, mw1272, mw1212, mw1174 - service hhvm restart [22:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:19] on 2 of them it was fast, 2 of them it takes forever [22:47:06] mw1269 Fatal error: File not found: /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/wikisource.org/w/api.php on line 2 [22:47:12] 3/4 done [22:47:16] mw1255 File not found: /srv/mediawiki/docroot/mediawiki/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/mediawiki/w/extract2.php on line 5 [22:47:16] 4/4 done [22:47:28] mw1188 File not found: /srv/mediawiki/docroot/wikiversity.org/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/wikiversity.org/w/favicon.php on line 2 [22:47:40] Son of a.... [22:48:00] mw1277 Fatal error: unknown exception [22:48:11] mw1241 File not found: /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/wikisource.org/w/api.php on line 2 [22:48:42] maybe we should put a symlink to multiversion in /srv/mediawiki/docroot/*/w/ [22:49:03] because the only way it's finding it depends on .. resolving to the actual parent and not the docroot dir [22:49:35] twentyafterfour: Yeah, which seems to Usually Work [22:49:39] Except when it's not [22:49:47] I'm afraid of it not'ing everywhere real quick [22:49:55] something somewhere is resolving .. by removing path segments instead of examining the filesystem? [22:50:12] twentyafterfour: I'm not a huge fan of exposing multiversion in the docroot... [22:50:23] hmm good point [22:50:47] use absolute paths instead of ..? [22:52:17] Problem with absolute is that you end up using mediawiki instead of -staging in mediawiki-staging [22:53:10] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2974435 (10GWicke) @Tgr and I discussed some more details in the office today. - General syntax: We both see general consensus around using query stri... [22:53:38] !log analytics1015, analytics1026 - puppet node clean (again?) - again having problems to remove decom'ed nodes from Icinga [22:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:34] looks like https://github.com/facebook/hhvm/issues/2083 [22:56:24] (03PS5) 10Madhuvishy: [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [22:56:55] !log analytics1015, analytics1026 - puppet node clean (again?) - again having problems to remove decom'ed nodes from Icinga (T147313) [22:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:59] T147313: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313 [22:58:43] !log db1019, db1042 - revoke puppet certs, delete salt keys, schedule icinga downtime, stop services (T149793, T146265) [22:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:48] T149793: Decommission db1042 - https://phabricator.wikimedia.org/T149793 [22:58:49] T146265: db1019: Decommission - https://phabricator.wikimedia.org/T146265 [22:59:23] ostriches: maybe we should call realpath() to resolve the symlinks explicitly? [23:00:34] it's behaving as if some code is using basename() to resolve the parent dir [23:00:48] I mean dirname() [23:04:58] twentyafterfour: that might work [23:05:59] But I still want to see if we can simplify the symlinks. They're super confusing [23:11:49] yeah [23:12:21] (03PS3) 10RobH: setting archiva.w.o to use le 1 or 2 [puppet] - 10https://gerrit.wikimedia.org/r/334229 [23:13:40] (03CR) 10RobH: [C: 032] setting archiva.w.o to use le 1 or 2 [puppet] - 10https://gerrit.wikimedia.org/r/334229 (owner: 10RobH) [23:14:00] !log going to try to convert archiva.wikimedia.org from GS to LE cert. will require rehup of nginx [23:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:51] (03CR) 10Dzahn: "I want to run all linting changes in compiler before merging them, but since this touches multiple modules that means having to find a rel" [puppet] - 10https://gerrit.wikimedia.org/r/334295 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:17:58] ostriches: standard-docroot already has absolute symlinks: w -> /srv/mediawiki/w [23:18:21] (03CR) 10Dzahn: "these are nicer to merge, the number of lines doesn't matter that much, but the number of modules does" [puppet] - 10https://gerrit.wikimedia.org/r/334302 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:18:38] so using relative includes is probably pointless [23:18:41] twentyafterfour: Yeah, the inconsistency drives me insane [23:18:49] and it's probably what's confusing hhvm [23:19:47] I mean why even have separate docroot symlinks, can't we just configure the webserver and hhvm to look at the standard docroot for all of those sites and eliminate all the mess? [23:20:48] Well, some wikis *need* alternative docroots. [23:20:54] The ones that don't could probably do that, yes. [23:20:58] That would remove 1 layer of dumb [23:21:11] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/5248/" [puppet] - 10https://gerrit.wikimedia.org/r/334302 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:21:41] (03PS3) 10Dzahn: phabricator: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334302 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:21:56] I guess the idea in theory is that it allows deployers to adjust the document root without a puppet change, but honestly we should think long and hard before creating custom docroots [23:22:00] That stuff sticks around forever [23:23:05] We could maybe handle the "non-wiki docroot stuff" in another way like rewrites. [23:23:12] Just a thought [23:26:05] twentyafterfour: My other thought is that maybe we can move w/ into standard-docroot/w directly and get rid of the top-level w/ [23:26:12] Unless other things need that... [23:26:31] Hard to search this stuff because so many of these symlinks break on your localhost :\ [23:27:19] yeah [23:27:56] The nice thing about symlinking foo.org -> standard-docroot though is it potentially lets us template-ize the apache config in a way that allows for both custom legacy docroots & the standard one [23:28:26] I mean you could do it other ways and still template I guess, just seems most straightforward [23:28:47] (03PS1) 10RobH: update archiva.w.o to use LE cert 2 of 2 (planned) [puppet] - 10https://gerrit.wikimedia.org/r/334445 [23:30:20] chad@notsexy /a/ops/mediawiki-config/docroot (master)$ find . -type l | wc -l [23:30:20] 158 [23:30:25] twentyafterfour: Sadfaces ^ :( [23:30:36] Oh, it's worse than that [23:30:44] 211 [23:31:01] 158 in the docroot alone [23:31:27] yeah [23:31:29] it's nuts [23:31:39] I don't think we need most of those symlinks [23:31:40] (03PS2) 10RobH: update archiva.w.o to use LE cert 2 of 2 (planned) [puppet] - 10https://gerrit.wikimedia.org/r/334445 [23:31:50] Now granted, most of those are actually the noc/conf stuff [23:31:54] Which aren't...as bad [23:32:02] It's the directory symlinks where it gets confusing [23:32:20] is it all clear for me to go ahead with the train? No remaining blockers [23:32:20] (03CR) 10RobH: [C: 032] update archiva.w.o to use LE cert 2 of 2 (planned) [puppet] - 10https://gerrit.wikimedia.org/r/334445 (owner: 10RobH) [23:32:57] twentyafterfour: I think we're fine [23:33:29] (03PS1) 1020after4: all wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334446 [23:33:31] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334446 (owner: 1020after4) [23:33:51] (03PS3) 10Dzahn: icinga: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334286 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:33:58] !log archiva.w.o maint done, uses new LE cert. [23:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:35] robh: cool, so we are basically done with the original list, right [23:34:42] i think so [23:34:44] :) [23:34:58] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/5250/" [puppet] - 10https://gerrit.wikimedia.org/r/334286 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:35:00] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334446 (owner: 1020after4) [23:35:09] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:35:13] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334446 (owner: 1020after4) [23:37:01] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.9 [23:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:12] ok this is weird, the error came back: [23:38:13] Warning: Empty regular expression in /srv/mediawiki/php-1.29.0-wmf.9/includes/parser/DateFormatter.php on line 200 [23:38:25] (03PS3) 10Dzahn: gerrit/git/graphite: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334285 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:38:55] (03PS1) 10RobH: archiva.w.o converted to le [puppet] - 10https://gerrit.wikimedia.org/r/334448 [23:39:09] twentyafterfour: Funny, I don't think any of the www.*.org symlinks to wwwportal are used [23:39:13] I think those all use wwwportal directly [23:39:27] Yep called it [23:39:57] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5251/" [puppet] - 10https://gerrit.wikimedia.org/r/334285 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:40:07] (03CR) 10RobH: [C: 032] archiva.w.o converted to le [puppet] - 10https://gerrit.wikimedia.org/r/334448 (owner: 10RobH) [23:40:15] (03PS2) 10RobH: archiva.w.o converted to le [puppet] - 10https://gerrit.wikimedia.org/r/334448 [23:40:18] (03CR) 10RobH: [V: 032 C: 032] archiva.w.o converted to le [puppet] - 10https://gerrit.wikimedia.org/r/334448 (owner: 10RobH) [23:40:56] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2974687 (10RobH) [23:40:59] 06Operations, 10Traffic, 13Patch-For-Review: convert archiva to use Letsencrypt for SSL cert (deadline 2017-05-08) - https://phabricator.wikimedia.org/T154942#2974685 (10RobH) 05Open>03Resolved Conversion done, serving the new LE cert. I've removed the old certificate/key off the host, and out of puppe... [23:41:06] (03PS1) 10Chad: Beta: Just use standard docroot directly for most sites [puppet] - 10https://gerrit.wikimedia.org/r/334449 [23:41:10] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10RobH) [23:41:26] mutante: though i still need to fix lists via the pthon script and adding file permission support [23:41:37] and then we'll likely add mail servers to the LE overall task i guess... [23:41:38] robh: ah, right [23:41:46] its grown in scope to mean all non wildcard [23:41:50] where i know it wasnt original intent but meh [23:41:56] there are more internal ones, but i dunno [23:42:01] for example CI [23:42:07] they are not for end users [23:42:20] those are just internal domains as well right, wmnet? [23:42:38] or they wikimedia.org? [23:42:46] contint1001/2001 f.e. are wikimedia.org [23:42:56] (03PS1) 10Chad: Drop www.*.org symlinks to wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334450 [23:42:57] ahh, then yeah we should likely convert to LE. [23:44:20] ok I'm rolling back to wmf.8 [23:44:47] (03Abandoned) 10Chad: Swap wmfwiki docroot to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/333974 (owner: 10Chad) [23:44:49] (03PS1) 1020after4: group2 wikis to 1.29.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334451 [23:44:51] (03CR) 1020after4: [C: 032] group2 wikis to 1.29.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334451 (owner: 1020after4) [23:45:43] (03PS3) 10Dzahn: jenkins: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334288 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:46:29] (03Merged) 10jenkins-bot: group2 wikis to 1.29.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334451 (owner: 1020after4) [23:47:20] (03CR) 10Dzahn: [C: 032] jenkins: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334288 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:47:55] (03CR) 10jenkins-bot: group2 wikis to 1.29.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334451 (owner: 1020after4) [23:49:06] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 wikis to 1.29.0-wmf.8 [23:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:11] (03PS3) 10Dzahn: monitoring: linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334297 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:49:18] (03PS1) 10Volans: Testreduce: allow to decide the state of the services [puppet] - 10https://gerrit.wikimedia.org/r/334452 (https://phabricator.wikimedia.org/T156177) [23:50:06] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2974718 (10Volans) @Marostegui I don't have any history on those, from a quick search those were old DBs migrated from Tampa if I'm not mistaken. [23:51:30] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2974725 (10Dzahn) This T84160 seems kind of related. @RobH do you know? [23:52:05] (03CR) 10Dzahn: [C: 032] monitoring: linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334297 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:53:27] (03PS3) 10Dzahn: docker: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334280 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:55:01] (03CR) 10Volans: "Puppet compiler: https://puppet-compiler.wmflabs.org/5252/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/334452 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [23:55:50] (03CR) 10Dzahn: [C: 032] docker: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334280 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:57:24] (03CR) 10Volans: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334280 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys)