[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151215T0000). Please do the needful. [00:00:05] bd808 aude Dereckson ebernhardson yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:19] * aude waves :) [00:00:35] o/ [00:00:57] * RoanKattouw takes it [00:01:00] I didn't make a submodule bump for my backport... hopefully it isn't needed [00:01:20] * bd808 is still confused about when gerrit does that magically and when it doesn't [00:01:34] bd808: It does that magically if your repo is not named VisualEditor or CentralNotice [00:01:37] bd808: should be automatic [00:01:42] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:02:43] oh, i guess i only pulled the beta-only config change to tin and not mira [00:02:50] Don't worry [00:02:52] I'm about to SWAT [00:02:55] kk [00:03:38] (03CR) 10Catrope: [C: 032] Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257397 (owner: 10Matěj Suchánek) [00:03:56] (03CR) 10Catrope: [C: 032] Set formatterUrlProperty setting for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) (owner: 10Aude) [00:04:20] ebernhardson: sync master should do that for you anyway... [00:04:33] Shouldn't have to manually do mira [00:04:34] (03Merged) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257397 (owner: 10Matěj Suchánek) [00:04:59] (03Merged) 10jenkins-bot: Set formatterUrlProperty setting for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) (owner: 10Aude) [00:05:03] ostriches: sync-master is new to me, i will next time. what's it do? :) [00:05:20] not mentioned anywhere in wikitech [00:05:37] ebernhardson: it's part of scap [00:05:42] well obviously :P [00:05:43] it's run as part of the process [00:05:47] i think nothing has changed [00:05:50] you don't need to manually run it [00:05:52] it's the first thing [00:05:55] sync masters [00:05:56] It keeps the other masters like mira in sync [00:05:58] from perspective of a deployer [00:05:59] sync rsync slaves [00:06:03] sync to everything else [00:06:07] !log catrope@tin Synchronized wmf-config/: SWAT: Wikidata config changes (duration: 00m 28s) [00:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:13] aude: ---^^ please verify [00:06:17] ok [00:06:21] ahh, in the past i didn't bother syncing out changes to labs only files [00:06:28] Asking for it to be a manual process... seperate... It's asking for trouble :) [00:06:31] yeah. I got burned by that last week too [00:06:56] and I wrote the damn sync-master bits so if anyone should have known better... [00:07:01] :) [00:07:22] * bd808 puts on his "manager" t-shirt and walks away slowly [00:07:39] I'm confused why are we talking about a manual process? [00:07:39] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:07:39] Because it's not lol [00:07:39] RoanKattouw: looks ok (nothing obvious broken, etc) [00:07:45] See ^^^^ [00:08:13] ostriches: some of us got used to being lazy and not syncing -labs config changes at all [00:08:22] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:25] mira catches us when we do that [00:08:42] bd808: it's always been like that for tin too :p [00:09:04] I thought all you had to do on tin was fetch and rebase [00:09:23] PROBLEM - HHVM rendering on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:29] Eh I see [00:09:36] Peeps lazy [00:10:33] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/MobileFrontend/: SWAT: fix page invalidation in mobile API (duration: 00m 36s) [00:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:26] RoanKattouw: looking good from here. [00:11:44] dr0ptp4kt_: can you test with the apps now? [00:11:56] He's just turning on a phone [00:12:00] standing next to my desk [00:12:02] apisnadbox looks better as does a curl on a random app server [00:12:06] heh [00:13:18] bd808: i tested ios, looks okay. [00:13:27] bd808: that is it fixed the bad behavior [00:13:34] thank you legoktm! [00:13:39] yes, thank you legoktm [00:13:44] :) [00:14:01] Dereckson: You around for your SWAT changes? [00:14:24] (03CR) 10Catrope: [C: 032] Use event-schemas repository for avro schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255135 (https://phabricator.wikimedia.org/T118570) (owner: 10EBernhardson) [00:15:08] (03Merged) 10jenkins-bot: Use event-schemas repository for avro schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255135 (https://phabricator.wikimedia.org/T118570) (owner: 10EBernhardson) [00:15:49] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879645 (10BBlack) The JSON payload size isn't what's important here, just the 2K URL limit.... [00:17:00] !log catrope@tin Synchronized wmf-config/: SWAT: Use event-schemas repository for avro schemas (duration: 00m 29s) [00:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:22] RoanKattouw: will know at 30 after how it's working (when the analytics pipeline picks up the latest events from kafka) [00:17:32] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [00:18:31] YuviPanda: I've finally pinpointed which server is the one that always takes longer to do deployment syncs than any other server: it's silver [00:18:48] RoanKattouw: might need to re-sync InitialiseSettings.php, a number of servers reported file not found, and they most likely cached that [00:18:54] RoanKattouw: haha, of course [00:18:58] do you know why? [00:19:05] a sync should do a touch as well [00:19:06] sync-file/sync-dir always get stuck at 99% (ok:467; fail:0; left:1) for like 10 seconds [00:19:15] ebernhardson: Oh I guess I'll have to touch it [00:19:17] Will touch and sync [00:19:32] because of the config cache [00:19:36] yea [00:19:53] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 29s) [00:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:15] YuviPanda: Not offhand. Is silver in a different rack than all the other app servers perhaps? [00:20:22] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879655 (10Ottomata) A sane solution would be at accept events via POST instead of encoded q... [00:20:27] RoanKattouw: possibly [00:20:30] or does it have slower connectivity to the nearest deployment proxy for some other reason? [00:20:50] zombocom [00:21:07] ? [00:21:23] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [00:21:32] > This is ZomboCom! And welcome to you, who have come to ZomboCom. Anything ... is possible ... at ZomboCom. You can do ... anything at ZomboCom. The infinite is possible at ZomboCom. The unattainable is unknown at ZomboCom. Welcome to ZomboCom. This ... is ZomboCom. [00:21:43] RoanKattouw: basically, all the things are possible! :) [00:22:54] RoanKattouw: file a bug! :D [00:23:27] (03CR) 10Catrope: [C: 032] delete GraphImgServiceAlways, add GraphEnableGZip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259189 (owner: 10Yurik) [00:24:01] RoanKattouw: also https://en.wikipedia.org/wiki/Zombo.com for the reference :) [00:24:06] * YuviPanda goes afk for a bit [00:27:12] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:28] RoanKattouw: FYI, silver is in the same rack as snapshot1004 [00:28:22] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:28:42] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:32:28] !log catrope@tin Synchronized wmf-config/CommonSettings.php: SWAT: Graph config updates (duration: 00m 29s) [00:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:07] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Graph config updates (duration: 00m 28s) [00:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:12] !log puppet disabled on silver since last Puppet run was at Fri Dec 11 15:27:28 UTC 2015 due to 'testing osm debugging' [00:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:57] OK, SWAT done [00:36:05] I didn't do Dereckson's patches because they weren't here [00:37:21] (The user, I mean, not the patches) [00:44:57] thx [01:00:05] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879772 (10Tgr) @BBlack, yes, which is why I am saying something should be done to improve c... [01:12:22] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [01:28:08] (03PS1) 10EBernhardson: Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 [01:28:24] (03CR) 10EBernhardson: [C: 032] Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 (owner: 10EBernhardson) [01:28:40] (03CR) 10jenkins-bot: [V: 04-1] Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 (owner: 10EBernhardson) [01:28:49] (03CR) 10jenkins-bot: [V: 04-1] Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 (owner: 10EBernhardson) [01:28:53] gah, of course i wrote tests.. [01:29:46] ebernhardson: What's wrong? [01:30:00] (03PS2) 10EBernhardson: Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 [01:30:10] RoanKattouw: camus isn't reading in the events. It turns out it only runs once an hour so i had to wait till now to find out [01:30:20] RoanKattouw: camus is the kafka->hdfs gateway [01:30:45] this leaves the submodule out, it just points back at the previous schema revision [01:31:04] (03CR) 10EBernhardson: [C: 032] Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 (owner: 10EBernhardson) [01:31:32] (03Merged) 10jenkins-bot: Revert to previous version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259202 (owner: 10EBernhardson) [01:31:33] i don't have enough access to the analytics cluster to debug, will have to wait till tomorrow for nuria [01:31:42] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:32:48] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Revert cirrus avro schema to 101446746400 due to camus not picking up events from new schema (duration: 00m 30s) [01:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:35:33] PROBLEM - puppet last run on mw2071 is CRITICAL: CRITICAL: puppet fail [01:37:53] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:03:36] !log deployed I6ebffe559 to job runners [02:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:53] RECOVERY - puppet last run on mw2071 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [02:12:26] ori: everything looks fine, including running a manual jobchron with --verbose [02:12:44] nice work [02:12:44] the periodic task cycle is MUCH more efficient now (and faster) [02:15:26] only takes a few good seconds [02:18:43] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: Puppet has 1 failures [02:25:02] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:11] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:26] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 11m 38s) [02:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:52] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.168 second response time [02:26:54] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 65697 bytes in 0.447 second response time [02:28:52] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [02:32:26] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 15 02:32:25 UTC 2015 (duration 6m 59s) [02:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:52] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:36:11] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.32.153:10042/complete/: Timeout on connection while downloading http://10.64.32.153:10042/complete/ [02:39:53] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [02:42:21] (03PS2) 10Dereckson: Throttle rule for University of Haifa event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259011 (https://phabricator.wikimedia.org/T121321) [02:46:02] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:01:32] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [03:03:41] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:12] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/mml/: Timeout on connection while downloading http://10.64.48.29:10042/mml/ [03:04:40] Dereckson, hey [03:05:23] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [03:06:29] (03CR) 10Alex Monk: [C: 032] ":S" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259010 (owner: 10Dereckson) [03:07:07] (03CR) 10Alex Monk: [C: 032] Throttle rule for University of Haifa event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259011 (https://phabricator.wikimedia.org/T121321) (owner: 10Dereckson) [03:07:35] (03Merged) 10jenkins-bot: Throttle typo: ip → IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259010 (owner: 10Dereckson) [03:07:56] (03Merged) 10jenkins-bot: Throttle rule for University of Haifa event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259011 (https://phabricator.wikimedia.org/T121321) (owner: 10Dereckson) [03:08:03] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [03:09:02] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/259011/ and https://gerrit.wikimedia.org/r/#/c/259010/1 (duration: 00m 30s) [03:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:17] Dereckson, done [03:09:21] Thanks. [03:09:36] np, sorry it didn't get done earlier [03:09:38] * Krenair has had a busy day [03:11:51] PROBLEM - graphoid endpoints health on sca1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [03:12:12] PROBLEM - graphoid endpoints health on sca1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [03:26:41] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:27:41] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: Puppet has 1 failures [03:43:02] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [03:46:52] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [03:53:01] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:56:08] Dereckson: doesn't https://gerrit.wikimedia.org/r/#/c/259011 overwrite the previous definition of the variable instead of adding to the array? [03:56:38] nevermind, I guess PHP makes no sense to me [03:56:43] s/to me// :P [04:02:32] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [04:06:22] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:14:12] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [04:17:33] (03PS1) 10Ori.livneh: Convert mw1026 - mw1035 from app servers to job runners [puppet] - 10https://gerrit.wikimedia.org/r/259207 [04:18:03] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:44:22] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [04:52:43] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:43] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:43] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:43] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:43] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:44] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:44] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:45] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:45] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:52:52] PROBLEM - Restbase endpoints health on restbase2003 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:01] PROBLEM - Restbase endpoints health on restbase2004 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:02] PROBLEM - Restbase endpoints health on restbase2005 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:02] PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:14] PROBLEM - Restbase endpoints health on restbase2002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:15] PROBLEM - Restbase endpoints health on restbase2006 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:38] hmm, that looks like graphoid being unhappy [04:53:42] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:42] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:52] PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:53] PROBLEM - Restbase endpoints health on restbase-test2001 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:53:53] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:54:22] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [04:55:13] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:03] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [04:58:32] sent a mail to yuri [04:59:04] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [05:01:02] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [05:02:07] http://graphoid.wikimedia.org/mediawiki.org/v1/png/Extension:Graph/0/be66c7016b9de3188ef6a585950f10dc83239837.png <- returns a 400 & "info/mwapi-error" [05:02:11] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:21:04] (03PS24) 10KartikMistry: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [05:42:42] PROBLEM - dhclient process on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:13] PROBLEM - Check size of conntrack table on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:22] PROBLEM - salt-minion processes on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:22] PROBLEM - DPKG on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:22] PROBLEM - Disk space on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:42] PROBLEM - puppet last run on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:23] PROBLEM - RAID on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:34] PROBLEM - Corp OIT LDAP Mirror on pollux is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:44:34] PROBLEM - configured eth on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:23] is anyone around who can fix jenkins? [06:20:47] http://i.stack.imgur.com/qgQSe.png [06:20:49] lol [06:21:32] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [06:24:52] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [06:30:32] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:32] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:43] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:12] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:33] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:04] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:41] PROBLEM - NTP on pollux is CRITICAL: NTP CRITICAL: No response from NTP server [06:44:32] PROBLEM - SSH on pollux is CRITICAL: Server answer [06:48:22] RECOVERY - SSH on pollux is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:52:21] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:52] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail [06:55:52] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:21] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:51] PROBLEM - SSH on pollux is CRITICAL: Server answer [07:15:43] RECOVERY - SSH on pollux is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:18:40] <_joe_> !log logged into console on pollux, that made it responsive [07:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:23] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:31:41] PROBLEM - SSH on pollux is CRITICAL: Server answer [07:33:33] RECOVERY - SSH on pollux is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:39:41] PROBLEM - SSH on pollux is CRITICAL: Connection timed out [07:44:11] PROBLEM - Host pollux is DOWN: PING CRITICAL - Packet loss = 100% [07:45:52] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:46:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 205, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:48:49] <_joe_> !log rebooting pollux [07:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:49:02] RECOVERY - Check size of conntrack table on pollux is OK: OK: nf_conntrack is 0 % full [07:49:12] RECOVERY - Host pollux is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms [07:49:22] RECOVERY - salt-minion processes on pollux is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:49:22] RECOVERY - Disk space on pollux is OK: DISK OK [07:49:22] RECOVERY - DPKG on pollux is OK: All packages OK [07:49:23] RECOVERY - SSH on pollux is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:49:43] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [07:50:36] RECOVERY - Corp OIT LDAP Mirror on pollux is OK: LDAP OK - 0.111 seconds response time [07:50:36] RECOVERY - RAID on pollux is OK: OK: no RAID installed [07:50:36] RECOVERY - configured eth on pollux is OK: OK - interfaces up [07:50:37] RECOVERY - dhclient process on pollux is OK: PROCS OK: 0 processes with command name dhclient [08:05:52] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [08:11:51] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [08:15:32] Nemo_bis, T117854 is an example of a ticket where #Mediawiki-database is apropiate but #DBA is not [08:25:31] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [08:26:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 207, down: 0, dormant: 0, excluded: 0, unused: 0 [08:29:09] !log stopping zuul-merger on gallium for maintenance [08:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:34:42] PROBLEM - zuul_merger_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [08:35:28] hashar: do you have access to icinga or shall I silence the check for you? [08:35:54] moritzm: I don't have privileges on icinga. I dont think non ops can [08:36:19] k, i'll mark zuul as in maintenance [08:38:20] I have no idea how permissions are granted in Icinga, possibly being in the contact group of an alarm should give ACK permission [08:44:14] investigating whether to migrate to icinga2 or shinken in one of the TechOps quarterly goals, maybe that will be possible with the new setup [08:47:02] !log restarted zuul-merger on gallium [08:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:32] RECOVERY - zuul_merger_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [08:52:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:52:32] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [1000.0] [08:53:03] moritzm: or maybe it is because my contact group is 'amusso' but I log as 'hashar' :-D [08:57:41] (03CR) 10Giuseppe Lavagetto: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 (owner: 10Giuseppe Lavagetto) [08:58:41] <_joe_> the 5xx spike was an all-ulsfo woe which has already been back [08:58:54] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 (owner: 10Giuseppe Lavagetto) [09:00:04] stat1002.eqiad.wmnet seems to be hosed up, even ps hangs [09:02:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:02:33] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:05:03] (03PS1) 10Aude: Enable Wikidata data access for meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259218 (https://phabricator.wikimedia.org/T117524) [09:06:57] dcausse: load is at 590 [09:07:04] http://ganglia.wikimedia.org/latest/?r=custom&cs=12%2F14%2F2015+00%3A00&ce=&m=cpu_report&c=Analytics+cluster+eqiad&h=stat1002.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [09:07:24] it has a nice CPU plateau since yesterday :( [09:11:32] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:42] PROBLEM - HHVM rendering on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:12:36] moritzm: yes looks like a broken mounted filesystem (nfs or fuse_dfs) :/ [09:12:42] PROBLEM - SSH on mw1128 is CRITICAL: Server answer [09:13:11] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:31] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:31] PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:31] PROBLEM - DPKG on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:31] PROBLEM - nutcracker port on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:41] PROBLEM - dhclient process on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:52] PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:53] PROBLEM - configured eth on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:55] (03CR) 10Aude: [C: 04-2] "scheduled to deploy at 23:00 UTC and not before then" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259218 (https://phabricator.wikimedia.org/T117524) (owner: 10Aude) [09:14:01] PROBLEM - Check size of conntrack table on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:14:21] PROBLEM - Disk space on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:14:22] PROBLEM - nutcracker process on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:14:23] PROBLEM - salt-minion processes on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:14:23] PROBLEM - HHVM processes on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:14:53] PROBLEM - RAID on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:01] PROBLEM - puppet last run on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:02] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:12] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:19] seems to be the hdfs mount, I can access the NFS mount point (/mnt/data) just fine [09:15:31] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:31] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:52] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:16:02] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:16:12] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:16:22] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:20:34] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:21:01] (03PS2) 10Hashar: enable EventBus logging channel (currently only in beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259156 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [09:21:01] PROBLEM - nutcracker port on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:32] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [09:21:52] RECOVERY - DPKG on mw1138 is OK: All packages OK [09:21:52] (03CR) 10Hashar: [C: 032] enable EventBus logging channel (currently only in beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259156 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [09:21:53] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [09:22:03] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:22:17] (03Merged) 10jenkins-bot: enable EventBus logging channel (currently only in beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259156 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [09:22:23] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [09:22:41] RECOVERY - RAID on mw1138 is OK: OK: no RAID installed [09:22:42] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [09:22:51] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 1 % full [09:22:51] RECOVERY - nutcracker port on mw1138 is OK: TCP OK - 0.000 second response time on port 11212 [09:23:01] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 65704 bytes in 7.472 second response time [09:23:01] RECOVERY - Disk space on mw1138 is OK: DISK OK [09:23:12] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.111 second response time [09:23:12] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:23:12] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [09:25:35] 6operations: fuse-dfs problems on stat1002 - https://phabricator.wikimedia.org/T121492#1880201 (10MoritzMuehlenhoff) 3NEW [09:26:22] (03PS3) 10Giuseppe Lavagetto: Use system-wide etcd configurations for the etcd driver [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 [09:26:24] (03PS2) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/258981 [09:26:26] (03PS5) 10Giuseppe Lavagetto: Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 [09:31:32] RECOVERY - Disk space on stat1002 is OK: DISK OK [09:31:52] RECOVERY - configured eth on mw1128 is OK: OK - interfaces up [09:31:52] RECOVERY - Check size of conntrack table on mw1128 is OK: OK: nf_conntrack is 0 % full [09:32:03] RECOVERY - Disk space on mw1128 is OK: DISK OK [09:32:03] !log hashar@tin Synchronized wmf-config/InitialiseSettings-labs.php: enable EventBus logging channel (currently only in beta) https://phabricator.wikimedia.org/T116786 (duration: 08m 57s) [09:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:11] RECOVERY - nutcracker process on mw1128 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:32:12] RECOVERY - salt-minion processes on mw1128 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:32:12] RECOVERY - HHVM processes on mw1128 is OK: PROCS OK: 6 processes with command name hhvm [09:32:32] RECOVERY - SSH on mw1128 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [09:32:34] !log umounted/remounted hdfs mount on stat1002 (got stuck due to kernel bug, see T121492) [09:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:12] RECOVERY - nutcracker port on mw1128 is OK: TCP OK - 0.000 second response time on port 11212 [09:33:12] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 42 minutes ago with 0 failures [09:33:12] RECOVERY - DPKG on mw1128 is OK: All packages OK [09:33:13] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.703 second response time [09:33:22] RECOVERY - dhclient process on mw1128 is OK: PROCS OK: 0 processes with command name dhclient [09:33:29] dcausse: the hdfs mount is working again, but I suppose the load will remain high for a bit until the piled up processes have finished [09:33:32] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 65704 bytes in 7.836 second response time [09:33:42] RECOVERY - RAID on mw1128 is OK: OK: no RAID installed [09:33:49] moritzm: thanks! [09:34:43] couldn't find anything on the underlying fuse bug, but filed a Phab task so that we can find it it it reappaers [09:41:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Νο, let's not go for this approach. bacula has support for bpipe plugins that allow streaming backups and restores as well as predump and " [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [09:53:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:53:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:53:52] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:53:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:54:52] <_joe_> uhm [09:55:00] <_joe_> looks like ulsfo is having issues [09:55:15] <_joe_> had [09:56:34] no, it was worse on esams [09:56:50] 50 vs 500/s [09:57:04] <_joe_> err, yes [09:57:06] but that is absolute numbers, not relative to the traffic [09:57:25] so "worse" is relative [09:57:43] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:57:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:58:52] can someone else have a look? unless it is an ongoing issue, I will go for a schedule downtime now [09:59:18] <_joe_> it's over [09:59:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:59:51] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:02:53] !log stopping eventlogging mysql consumers [10:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:02] (03PS1) 10Jcrespo: Rolling in new mysql configuration for m4 servers [puppet] - 10https://gerrit.wikimedia.org/r/259222 [10:07:21] (03CR) 10Jcrespo: [C: 032] Rolling in new mysql configuration for m4 servers [puppet] - 10https://gerrit.wikimedia.org/r/259222 (owner: 10Jcrespo) [10:07:49] (03PS2) 10Jcrespo: Enable ferm on db1046 [puppet] - 10https://gerrit.wikimedia.org/r/240043 (owner: 10Muehlenhoff) [10:08:16] (03CR) 10Jcrespo: [C: 032] Enable ferm on db1046 [puppet] - 10https://gerrit.wikimedia.org/r/240043 (owner: 10Muehlenhoff) [10:09:56] !log enabling ferm, and restarting mysql at db1046 (m4-master, eventlogging) [10:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:53] jynus: nice, so I triaged it correctly :) [10:16:16] :-) [10:16:29] jynus: note however that precisely *because* that bug doesn't affect WMF, WMF is the reason the bug exists in the first place [10:16:41] is is better understood now? I do not mean to not work on mediawiki-databases [10:16:51] I actually want to send a patch to that [10:17:00] but they are different things [10:17:03] The difference between #MediaWiki-Database and #Wikimedia-Database is (now) very clear [10:17:13] Just hte process for schema changes is still unclear to me [10:17:13] I documented it here: [10:17:28] https://wikitech.wikimedia.org/wiki/Schema_changes [10:17:40] hope that helps [10:17:46] Thanks for working on a patch [10:17:52] please add comments of things that are not clear [10:17:58] and I will add them [10:18:10] basically I am trying to fix problems like that one [10:18:26] where schema changes go to the code, but are not applied, or viceversa [10:18:47] plus making sure I am notified- sometime I do not even get notified of a schema change to apply to the wikis [10:18:58] I'll look at it again this evening [10:20:57] !log stopped eventlogging on dbstore1002 and db1047 [10:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:54] (03CR) 10Alexandros Kosiaris: "This commit message is deceptive. nothing about pollux ended up in the commit itself" [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul) [10:25:52] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:25:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [1000.0] [10:26:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [10:27:22] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [10:30:19] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1880357 (10akosiaris) >>! In T117423#1878714, @Dzahn wrote: > what about site.pp and > > 2144 # LDAP servers relied on by OIT for mail > 2145 node /(dubnium|pollux)\.wikimedia\.org/ { > 2146... [10:30:22] ^those are mine, and I am blocked in the middle of a scheduled downtime by mforns [10:30:54] ping me if you want to deploy somthing on puppet so I can revert those [10:31:40] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1880376 (10akosiaris) a:3Papaul [10:34:01] what's up with graphoid btw? mobrovac ? [10:35:32] (03CR) 10Alexandros Kosiaris: "that's kind of counter-intutive and a hack. I 'd rather we figured out some other way to have the cluster -> hosts mapping that we are mis" [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) (owner: 10Filippo Giunchedi) [10:36:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:36:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:36:42] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:37:32] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:39:15] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1880399 (10akosiaris) >>! In T120281#1879090, @Ottomata wrote: > @akosiaris can you help with this? I am not even sure... [10:40:03] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/server-side-events-log consumer/mysql-m4-master-03 consumer/mysql-m4-master-02 consumer/mysql-m4-master-01 consumer/mysql-m4-master-00 consumer/client-side-events-log consumer/all-events-log processor/server-side-0 processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side [10:40:47] (03PS1) 10Giuseppe Lavagetto: puppet: extract the common parts of operations/puppet git clone [puppet] - 10https://gerrit.wikimedia.org/r/259225 [10:41:11] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:41:39] <_joe_> jynus: ^^ expected? [10:42:28] yes, part of the log [10:42:51] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:43:33] well, I should have manually it failover it, but it doesn't matter, I cought 5/6 places that were expected to fail [10:47:12] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [10:47:34] (03PS10) 10MaxSem: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [10:47:36] akosiaris, I did all I could ^^^ :P [10:47:55] mmm, 1004 was expected, 1002, not [10:48:46] (03CR) 10jenkins-bot: [V: 04-1] OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [10:49:08] <3 jerkins:P [10:50:50] (03PS11) 10MaxSem: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [10:51:09] strange, dbproxy1002 is showing both up [10:51:41] I think it is badly configured [10:53:01] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [10:54:18] (03PS1) 10Muehlenhoff: Stop opendj on the former labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/259226 [10:56:13] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [10:57:26] (03PS1) 10DCausse: Specify latest schema for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/259227 (https://phabricator.wikimedia.org/T121483) [10:58:26] (03PS5) 10Alexandros Kosiaris: monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 [10:58:41] MaxSem: ok, I 'll review it later today [10:58:57] thanks! [10:59:22] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: puppet fail [10:59:27] (03PS2) 10DCausse: Specify latest schema for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/259227 (https://phabricator.wikimedia.org/T121483) [11:01:15] (03PS6) 10Alexandros Kosiaris: monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 [11:01:19] !log reloading haproxy on dbproxy1002 [11:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:02:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Puppet compiler declared it almost a noop (http://puppet-compiler.wmflabs.org/1489/neon.wikimedia.org/), merging" [puppet] - 10https://gerrit.wikimedia.org/r/259004 (owner: 10Alexandros Kosiaris) [11:02:03] _joe_, it had stale config that lucky didn't cause any issues (it was using the wrong failover host- db1046 instead of db2011) [11:02:22] <_joe_> jynus: how cna that happen? [11:02:36] puppet was changed, but haproxy was not refreshed [11:02:46] (I assume that) [11:03:10] so puppet has to set a refresh on haproxy [11:03:24] or if it was disable willingly, it has to be done manually [11:03:36] I do not see a reason to do it manually [11:04:24] will look at it later, I am still on downtime for eventlogging, checking everithing else [11:07:29] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1880419 (10Lydia_Pintscher) 5Invalid>3Open I am still not getting emails for some edits. This one for example should have triggered an email to me but I did not receive it: https://www.wikidata... [11:10:23] (03PS2) 10Yuvipanda: redis: upstart should track PID after one fork [puppet] - 10https://gerrit.wikimedia.org/r/258972 (https://phabricator.wikimedia.org/T121396) (owner: 10Hashar) [11:10:34] (03CR) 10Yuvipanda: [C: 032 V: 032] "Was killing Quarry" [puppet] - 10https://gerrit.wikimedia.org/r/258972 (https://phabricator.wikimedia.org/T121396) (owner: 10Hashar) [11:14:52] _joe_ akosiaris I'd like to talk about https://gerrit.wikimedia.org/r/#/c/258473/ if you guys have some time, not necessarily now [11:15:28] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1880432 (10akosiaris) a:5akosiaris>3mark Hello, The barely in warranty or expired boxes are not sufficient enough. So in CODFW, let's go for one of the new boxes. In EQIAD... [11:21:54] !log rebooting lvs3004 for kernel upgrade [11:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:25:40] !log rebooting lvs4003/lvs4004 for kernel upgrade [11:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:17] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:27:31] what's up with graphoid? [11:27:52] icinga is like a christmas tree [11:28:13] <_joe_> paravoid: I am going to look at it shortly, it seems like the spec is wrong [11:30:36] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Puppet last ran 13 days ago [11:32:38] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:33:36] !log force-rebooting stat1002, kernel borked because of fuse [11:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:48] PROBLEM - Host stat1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:49] _joe_, as far as I can see haproxy is not installed or handled in any way by puppet, except its configuration [11:36:56] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [11:37:07] RECOVERY - Host stat1002 is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [11:37:08] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:39:08] RECOVERY - HHVM rendering on mw1093 is OK: HTTP OK: HTTP/1.1 200 OK - 65428 bytes in 1.125 second response time [11:40:27] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [11:41:17] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet last ran 3 days ago [11:41:28] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:41:33] <_joe_> godog: akosiaris I'd say now if you want [11:41:46] <_joe_> want/can [11:43:17] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [11:46:03] !log restarting and upgrading dbstore2002 [11:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:49] _joe_: works for me! [12:00:04] kart_ akosiaris mobrovac: Dear anthropoid, the time has come. Please deploy Content Translation server service-runner migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151215T1200). [12:00:04] kart_: A patch you scheduled for Content Translation server service-runner migration is about to be deployed. Please be available during the process. [12:01:14] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1880481 (10mark) a:5mark>3None Approved. [12:05:16] (03CR) 10Alexandros Kosiaris: "mostly lgtm, one inline question" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259225 (owner: 10Giuseppe Lavagetto) [12:20:07] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1880545 (10akosiaris) a:3RobH thanks @mark! [12:23:32] (03PS1) 10Jcrespo: Enabling ssl on dbstores, disabling peformance_schema for now [puppet] - 10https://gerrit.wikimedia.org/r/259233 [12:24:25] (03CR) 10jenkins-bot: [V: 04-1] Enabling ssl on dbstores, disabling peformance_schema for now [puppet] - 10https://gerrit.wikimedia.org/r/259233 (owner: 10Jcrespo) [12:26:20] (03PS2) 10Jcrespo: Enabling ssl on dbstores, disabling peformance_schema for now [puppet] - 10https://gerrit.wikimedia.org/r/259233 [12:26:48] (03PS3) 10Jcrespo: Enabling ssl on dbstores, disabling performance_schema for now [puppet] - 10https://gerrit.wikimedia.org/r/259233 [12:32:08] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [12:32:58] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [12:33:57] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [12:35:36] RECOVERY - Restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [12:36:12] ok, graphoid problem fixed in staging ^^^ [12:36:16] proceeding to prod ... [12:36:47] RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [12:37:06] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [12:39:21] !log restbase deploy start of 844a41d [12:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:08] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [12:40:23] * mobrovac 1 : 0 failures :) [12:41:46] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [12:42:07] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [12:42:07] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [12:42:08] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [12:43:47] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [12:44:17] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [12:44:26] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [12:44:57] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [12:45:47] RECOVERY - Restbase endpoints health on restbase2001 is OK: All endpoints are healthy [12:46:37] RECOVERY - Restbase endpoints health on restbase2002 is OK: All endpoints are healthy [12:46:37] RECOVERY - Restbase endpoints health on restbase2005 is OK: All endpoints are healthy [12:46:38] RECOVERY - Restbase endpoints health on restbase2004 is OK: All endpoints are healthy [12:46:38] RECOVERY - Restbase endpoints health on restbase2003 is OK: All endpoints are healthy [12:48:00] (03CR) 10Faidon Liambotis: "I don't particularly get the broker/config split. The variable lookups for config variables from the broker class look particularly ugly." [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) (owner: 10Ottomata) [12:48:46] RECOVERY - Restbase endpoints health on restbase2006 is OK: All endpoints are healthy [12:48:50] !log restbase deploy end of 844a41d [12:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "overall looks good, comments inline" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [12:57:02] (03PS5) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [12:58:35] (03PS1) 10Faidon Liambotis: nagios_common: kill all the love [puppet] - 10https://gerrit.wikimedia.org/r/259237 [13:01:42] (03CR) 10Alexandros Kosiaris: [C: 031] "Yes please!!!" [puppet] - 10https://gerrit.wikimedia.org/r/259237 (owner: 10Faidon Liambotis) [13:03:19] (03CR) 10Mobrovac: RESTBase: Switch to service::node (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:10:19] (03CR) 10Jcrespo: [C: 031] nagios_common: kill all the love [puppet] - 10https://gerrit.wikimedia.org/r/259237 (owner: 10Faidon Liambotis) [13:12:54] (03CR) 10Jcrespo: [C: 032] Enabling ssl on dbstores, disabling performance_schema for now [puppet] - 10https://gerrit.wikimedia.org/r/259233 (owner: 10Jcrespo) [13:28:30] (03CR) 10Filippo Giunchedi: [C: 031] nagios_common: kill all the love [puppet] - 10https://gerrit.wikimedia.org/r/259237 (owner: 10Faidon Liambotis) [13:29:04] (03PS2) 10Faidon Liambotis: nagios_common: kill all the love [puppet] - 10https://gerrit.wikimedia.org/r/259237 [13:29:10] this team is really no fun at all is it :P [13:29:18] (03CR) 10Faidon Liambotis: [C: 032] nagios_common: kill all the love [puppet] - 10https://gerrit.wikimedia.org/r/259237 (owner: 10Faidon Liambotis) [13:30:44] ffff [13:30:46] why :( [13:30:51] (03CR) 10Filippo Giunchedi: "LGTM overall, some comments" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [13:30:59] can't comment 🍩 in Gerrit Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect string value: '\xF0\x9F\x8D\xA9' for column 'message' at row 1 [13:31:46] paravoid: slightly snarky to get pages with love, OTOH also relieving [13:32:14] (03PS1) 10Jcrespo: Repool db1018, depool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259239 [13:35:20] (03PS1) 10BBlack: apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) [13:36:27] (03CR) 10jenkins-bot: [V: 04-1] apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [13:36:56] (03CR) 10Andrew Bogott: "Could we implement this flag lower down, in the icinga classes? So that setting that flag is_test_machine in hiera works to override the p" [puppet] - 10https://gerrit.wikimedia.org/r/259073 (https://phabricator.wikimedia.org/T120047) (owner: 10Dzahn) [13:37:12] !log bumping composer on CI to 1.0.0-alpha11 https://gerrit.wikimedia.org/r/#/c/258933/ [13:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:32] (03PS2) 10BBlack: apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) [13:37:38] (03PS2) 10Jcrespo: Repool db1018, depool db1027 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259239 [13:38:30] (03CR) 10jenkins-bot: [V: 04-1] apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [13:38:36] (03CR) 10Andrew Bogott: [C: 031] "Let's put the roll-out on the deployment calendar so no one freaks out if there are hiccups" [puppet] - 10https://gerrit.wikimedia.org/r/259055 (owner: 10RobH) [13:39:00] (03CR) 10Jcrespo: [C: 032] Repool db1018, depool db1027 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259239 (owner: 10Jcrespo) [13:39:44] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259239 (owner: 10Jcrespo) [13:40:40] (03PS2) 10Andrew Bogott: base::labs: rename class with dash character [puppet] - 10https://gerrit.wikimedia.org/r/258055 (owner: 10Dzahn) [13:41:07] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1018 after maintenance; depool db1027 for maintenance (duration: 00m 29s) [13:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:17] PROBLEM - HHVM rendering on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:27] (03PS3) 10BBlack: apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) [13:45:07] RECOVERY - HHVM rendering on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 65823 bytes in 0.129 second response time [13:45:18] (03CR) 10Andrew Bogott: [C: 032] base::labs: rename class with dash character [puppet] - 10https://gerrit.wikimedia.org/r/258055 (owner: 10Dzahn) [13:45:45] !log reverted composer upgrade on CI with https://gerrit.wikimedia.org/r/#/c/259241/ [13:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:27] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 45 failures [13:49:05] (03CR) 10Alexandros Kosiaris: RESTBase: Switch to service::node (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:53:48] !log disabling puppet on neon to avoid race-condition ipsec alert spam [13:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:11] (03PS4) 10BBlack: apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) [13:54:35] (03CR) 10BBlack: [C: 032 V: 032] apply ipsec associations to kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/259240 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [13:55:54] ugh [13:56:19] ottomata: is there some reason we didn't put the ipv6 DNS for kafka10xx in? [13:57:12] (03CR) 10Mobrovac: RESTBase: Switch to service::node (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [14:02:15] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1880676 (10mark) >>! In T120831#1862403, @ArielGlenn wrote: > I'll be applying a patch for that of 3 whole lines, and then doing more testing/checks... [14:02:37] (03CR) 10Faidon Liambotis: [C: 04-1] "First pass, see inline." (0332 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [14:03:01] (03CR) 10Alexandros Kosiaris: Add shinken module/roles (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [14:05:19] akosiaris: for a moment there I thought you responded to 1/3rd of my comments in a single minute [14:05:58] (03PS1) 10BBlack: add ipv6 DNS for kafka10(1[2348]|20) [dns] - 10https://gerrit.wikimedia.org/r/259245 [14:07:59] (03CR) 10BBlack: [C: 032] add ipv6 DNS for kafka10(1[2348]|20) [dns] - 10https://gerrit.wikimedia.org/r/259245 (owner: 10BBlack) [14:26:58] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 8 connecting: kafka1012_v4,kafka1012_v6,kafka1013_v4,kafka1013_v6,kafka1014_v4,kafka1014_v6,kafka1018_v4,kafka1018_v6,kafka1020_v4,kafka1020_v6,kafka1022_v4,kafka1022_v6 [14:28:59] ^ that's me [14:31:00] (03PS1) 10ArielGlenn: add yubi neo ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/259249 [14:32:25] (03CR) 10ArielGlenn: [C: 032] add yubi neo ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/259249 (owner: 10ArielGlenn) [14:33:03] (03PS3) 10Filippo Giunchedi: ganglia: add ganglia::cluster exported resource [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) [14:34:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "spoke with Joe, Alex and Faidon, even though we've agreed it is an hack it gets things going with experiment having ganglia clusters in gr" [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) (owner: 10Filippo Giunchedi) [14:41:17] !log restarting mysql on db1027 to apply new configuration [14:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:52] (03CR) 10Ottomata: "The broker config needs to be looked up from any location, so it can't be scoped to only a certain site or role or node. E.g., a node in " [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) (owner: 10Ottomata) [14:50:21] (03CR) 10Ottomata: "The ::config class allows clients to include a puppet class and have access to variables that otherwise would have to be generated over an" [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) (owner: 10Ottomata) [14:52:03] (03CR) 10Ottomata: "Makes sense. Although, perhaps we should make looking up the latest schema from the deployed schema repository a function our file based " [puppet] - 10https://gerrit.wikimedia.org/r/259227 (https://phabricator.wikimedia.org/T121483) (owner: 10DCausse) [14:56:26] (03CR) 10DCausse: "Yes, this is still a workaround for our "in-jar" schema resolver, it's not straightforward to scan a jar at runtime so not sure what's the" [puppet] - 10https://gerrit.wikimedia.org/r/259227 (https://phabricator.wikimedia.org/T121483) (owner: 10DCausse) [14:57:44] (03CR) 10Alexandros Kosiaris: Add shinken module/roles (0324 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [14:58:30] (03PS6) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [14:58:39] paravoid: now I 've responded to your comments [14:58:48] definitely way more than a minute ;-) [14:59:02] (03PS3) 10Ottomata: Specify latest schema for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/259227 (https://phabricator.wikimedia.org/T121483) (owner: 10DCausse) [14:59:19] (03CR) 10Ottomata: [C: 032 V: 032] Specify latest schema for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/259227 (https://phabricator.wikimedia.org/T121483) (owner: 10DCausse) [15:00:28] akosiaris: not all of them? :) [15:00:43] (03CR) 10jenkins-bot: [V: 04-1] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [15:01:09] (03CR) 10Alexandros Kosiaris: "For what is worth, I already tried to deduplicate via the shinken::daemon and shinken::arbiter::daemon mechanism. That is probably why the" [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [15:01:15] paravoid: I think so [15:01:54] btw, for such complex patchsets I prefer it when new versions are not rebased [15:02:03] that way, I can just do patch set 5->6 diff via the gerrit web interface [15:03:27] paravoid: same here, but unfortunately I had to to rebase on top of the contactgroups monitoring (related) change and 2 damn bugs in rubocop [15:03:37] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1880786 (10Ottomata) Yeah, it would have to be a special endpoint, and it'd likely would onl... [15:03:44] otherwise rubocop would vote -1 [15:05:18] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:05:37] 6operations: fuse-dfs problems on stat1002 - https://phabricator.wikimedia.org/T121492#1880791 (10Ottomata) Thanks @MoritzMuehlenhoff, did you do anything to fix? It seems to be ok now. https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_.2Fmnt.2Fhdfs [15:09:35] (03PS1) 10Muehlenhoff: Set idle_timelimit for nslcd [puppet] - 10https://gerrit.wikimedia.org/r/259256 [15:10:16] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [15:10:26] (03CR) 10Faidon Liambotis: Add shinken module/roles (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [15:13:35] (03PS1) 10Jcrespo: Update configuration for s3 slaves on codfw (+style changes) [puppet] - 10https://gerrit.wikimedia.org/r/259257 [15:14:31] 6operations: fuse-dfs problems on stat1002 - https://phabricator.wikimedia.org/T121492#1880802 (10MoritzMuehlenhoff) It fixed the mount manually, see SAL: 09:32 moritzm: umounted/remounted hdfs mount on stat1002 (got stuck due to kernel bug, see T121492) But the load was still excessive later on and Faidon ev... [15:15:16] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 168 seconds ago with 0 failures [15:15:58] (03CR) 10Jcrespo: [C: 032] Update configuration for s3 slaves on codfw (+style changes) [puppet] - 10https://gerrit.wikimedia.org/r/259257 (owner: 10Jcrespo) [15:16:07] (03CR) 10Alexandros Kosiaris: Add shinken module/roles (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [15:18:00] (03PS7) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [15:18:43] !log restarting db2018 for upgrade and configuration change [15:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:54] (03CR) 10Andrew Bogott: [C: 04-1] Set idle_timelimit for nslcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [15:20:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:49] (03CR) 10Alexandros Kosiaris: [C: 031] RESTBase: Switch to service::node (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [15:20:55] 6operations: fuse-dfs problems on stat1002 - https://phabricator.wikimedia.org/T121492#1880810 (10Ottomata) p:5Triage>3Normal a:3Ottomata [15:22:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4011 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [20000.0] [15:23:04] (03CR) 10Alexandros Kosiaris: [C: 031] Set idle_timelimit for nslcd [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [15:23:18] akosiaris: \o/ :P [15:24:24] PROBLEM - MariaDB Slave IO: s3 on db2057 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2018.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2018.codfw.wmnet (111 Connection refused) [15:25:04] <_joe_> mobrovac: I'll get to reviewing your change shortly :) [15:25:07] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 220 seconds ago with 0 failures [15:25:28] 6operations, 10hardware-requests: Decommission and remove from racks out of warranty spares - https://phabricator.wikimedia.org/T121007#1880817 (10mark) a:5mark>3Cmjohnson Yes, go ahead. [15:26:30] grazie _joe_ ! [15:26:36] (03PS1) 10Bartosz Dziewoński: Enable cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259258 (https://phabricator.wikimedia.org/T120867) [15:28:55] (03PS1) 10Ottomata: Parameterize $log_max_backup_index in kafka log4j.properties [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259259 [15:29:18] (03CR) 10Giuseppe Lavagetto: puppet: extract the common parts of operations/puppet git clone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259225 (owner: 10Giuseppe Lavagetto) [15:29:23] (03CR) 10jenkins-bot: [V: 04-1] Parameterize $log_max_backup_index in kafka log4j.properties [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259259 (owner: 10Ottomata) [15:29:54] (03PS2) 10Giuseppe Lavagetto: puppet: extract the common parts of operations/puppet git clone [puppet] - 10https://gerrit.wikimedia.org/r/259225 [15:29:58] (03PS2) 10Ottomata: Parameterize $log_max_backup_index in kafka log4j.properties [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259259 [15:31:28] (03PS3) 10Giuseppe Lavagetto: puppet: extract the common parts of operations/puppet git clone [puppet] - 10https://gerrit.wikimedia.org/r/259225 [15:31:41] <_joe_> akosiaris: ^^ should be ok now [15:32:03] (03CR) 10Ottomata: [C: 032] Parameterize $log_max_backup_index in kafka log4j.properties [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259259 (owner: 10Ottomata) [15:32:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4011 is OK: OK: Less than 80.00% above the threshold [0.0] [15:33:39] (03PS1) 10Ottomata: Update kafka module with $log_max_backup_index change [puppet] - 10https://gerrit.wikimedia.org/r/259260 [15:34:34] RECOVERY - MariaDB Slave IO: s3 on db2057 is OK: OK slave_io_state Slave_IO_Running: Yes [15:35:07] (03CR) 10Ottomata: [C: 032] Update kafka module with $log_max_backup_index change [puppet] - 10https://gerrit.wikimedia.org/r/259260 (owner: 10Ottomata) [15:38:42] (03CR) 10Muehlenhoff: Set idle_timelimit for nslcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [15:41:22] (03PS1) 10BBlack: ipsec-for-kafka: limit to kafka1012 + cp4011 for testing [puppet] - 10https://gerrit.wikimedia.org/r/259263 (https://phabricator.wikimedia.org/T92602) [15:42:06] (03PS2) 10BBlack: ipsec-for-kafka: limit to kafka1012 + cp4011 for testing [puppet] - 10https://gerrit.wikimedia.org/r/259263 (https://phabricator.wikimedia.org/T92602) [15:42:58] (03CR) 10jenkins-bot: [V: 04-1] ipsec-for-kafka: limit to kafka1012 + cp4011 for testing [puppet] - 10https://gerrit.wikimedia.org/r/259263 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [15:43:08] ugh fuck submodules [15:44:51] dawww [15:46:17] they're such a pain in the ass when submodule updates randomly appear in your pulls while rebasing/merging/etc [15:46:24] and then get sucked into commits [15:46:43] (03PS3) 10BBlack: ipsec-for-kafka: limit to kafka1012 + cp4011 for testing [puppet] - 10https://gerrit.wikimedia.org/r/259263 (https://phabricator.wikimedia.org/T92602) [15:46:53] (03PS1) 10Jcrespo: Depool db1042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259266 [15:47:56] (03CR) 10BBlack: [C: 032] ipsec-for-kafka: limit to kafka1012 + cp4011 for testing [puppet] - 10https://gerrit.wikimedia.org/r/259263 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [15:48:44] (03PS2) 10Bartosz Dziewoński: Enable cross-wiki upload A/B test on English-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259258 (https://phabricator.wikimedia.org/T120867) [15:49:12] (03CR) 10Jcrespo: [C: 032] Depool db1042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259266 (owner: 10Jcrespo) [15:50:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1042 for maintenance (duration: 00m 29s) [15:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:41] (03PS2) 10Muehlenhoff: Set idle_timelimit for nslcd [puppet] - 10https://gerrit.wikimedia.org/r/259256 [15:51:15] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1880860 (10EBernhardson) Data would move in both directions. The two linked tickets are about shipping a page populari... [15:51:26] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [15:51:33] (03CR) 10Andrew Bogott: [C: 031] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [15:52:10] (03PS3) 10Bartosz Dziewoński: Enable cross-wiki upload A/B test on English-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259258 (https://phabricator.wikimedia.org/T120867) [15:52:41] (03PS1) 10BBlack: post-merge fixup for 8b9dfe360 [puppet] - 10https://gerrit.wikimedia.org/r/259268 [15:52:51] (03CR) 10Giuseppe Lavagetto: "seems ok in general, some minor nits basically." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [15:52:54] (03CR) 10BBlack: [C: 032 V: 032] post-merge fixup for 8b9dfe360 [puppet] - 10https://gerrit.wikimedia.org/r/259268 (owner: 10BBlack) [15:58:43] (03PS1) 10Jcrespo: Reconfigure db1042 and all s4 codfw mysqls [puppet] - 10https://gerrit.wikimedia.org/r/259269 [15:58:45] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 9 connecting: (unnamed),kafka1012_v4 [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151215T1600). Please do the needful. [16:00:04] MatmaRex James_F jgirault jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:14] * James_F waves. [16:00:34] I can SWAT. MatmaRex ping. [16:01:28] yeah, i'm here [16:02:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259258 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz Dziewoński) [16:03:06] (03Merged) 10jenkins-bot: Enable cross-wiki upload A/B test on English-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259258 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz Dziewoński) [16:03:50] ACKNOWLEDGEMENT - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 9 connecting: (unnamed),kafka1012_v4 Brandon Black still testing, known broken... [16:03:50] ACKNOWLEDGEMENT - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 1 connecting: cp4011_v4 Brandon Black still testing, known broken... [16:03:55] I'm here in case MatmaRex disappears again. ;) [16:04:04] heh [16:04:50] jynus: looks like I pulled down a change of yours, too: Depool db1042 for maintenance [16:05:23] ? [16:05:46] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [16:05:49] it was not +1/+2ed [16:06:19] or was it? [16:06:20] 15:49 < grrrit-wm> (CR) Jcrespo: [C: 2] Depool db1042 for maintenance [mediawiki-config] - https://gerrit.wikimedia.org/r/259266 (owner: Jcrespo) [16:06:22] that's weird. It came down as part of git-fetch in mw-config [16:06:40] it seems it was [16:07:14] heh, do you want to sync it before I continue? Or revert? [16:07:14] sync, no problem [16:07:44] I cannot seem to remember what I am deploying [16:07:52] jynus: kk, I can do that :) [16:07:58] :D [16:08:20] I swear I do not remember +2ing it [16:08:56] !log thcipriani@tin Synchronized wmf-config/db-eqiad.php: SWATish: Depool db1042 for maintenance (duration: 00m 29s) [16:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:02] ^ jynus done [16:09:36] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 10 ESP OK [16:09:43] do not worry about monitoring it, even if I am not deploying I have https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError always open [16:09:46] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [16:09:50] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable cross-wiki upload A/B test on English-language wikis [[gerrit:259258]] (duration: 00m 29s) [16:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:00] ^ MatmaRex check if possible please [16:10:30] I know what it was, that came with a puppet patch, and I got both confused [16:10:41] thcipriani: it should be a no-op right now, the code that uses this is in wmf.9 [16:10:55] thcipriani: i'll verify it when someone deploys the train later today [16:11:03] MatmaRex: kk, sounds good, thanks. [16:11:22] (03PS1) 10Ottomata: Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) [16:11:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258409 (owner: 10Jforrester) [16:12:15] (03Merged) 10jenkins-bot: BetaFeatures: Update language and dates of 'retirement' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258409 (owner: 10Jforrester) [16:12:16] Yay for comment-only 'config' changes. ;-) [16:12:25] heh [16:12:26] fyi bd808: https://gerrit.wikimedia.org/r/#/c/259271/ [16:12:29] (03CR) 10jenkins-bot: [V: 04-1] Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) (owner: 10Ottomata) [16:12:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258403 (owner: 10Jforrester) [16:13:58] (03PS2) 10Ottomata: Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) [16:14:00] (03Merged) 10jenkins-bot: In VisualEditor on single edit tab wikis, set the default editor appropriately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258403 (owner: 10Jforrester) [16:14:12] !log switch db2018's master from s3-master to db1027 [16:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:18] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: BetaFeatures: Update language and dates of "retirement" [[gerrit:258409]] (duration: 00m 29s) [16:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:56] James_F> Not sure if sarcastic or actually being serious [16:15:41] jynus: Semi-sarcastic. It's important to make sure our code and config documentation is up-to-date, but it's a bit of a chore to SWAT it. [16:16:35] James_F, I agree, that is why I will try to allow you to get rid of my commits [16:16:53] jynus: Your commits? [16:17:06] if you see my history on mediawiki-config, you will understand :-) [16:17:21] Oh, with the db mastery changes? Yeah. :-) [16:17:53] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: In VisualEditor on single edit tab wikis, set the default editor appropriately [[gerrit:258403]] (duration: 00m 28s) [16:17:55] ^ James_F check please [16:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:58] Woo. [16:18:29] thcipriani: Looks good. [16:18:35] James_F: cool, thanks. [16:18:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259039 (https://phabricator.wikimedia.org/T121421) (owner: 10Jforrester) [16:19:44] (03Merged) 10jenkins-bot: VisualEditor: Enable single edit tab mode on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259039 (https://phabricator.wikimedia.org/T121421) (owner: 10Jforrester) [16:20:01] thcipriani: This one won't take effect 'til wmf.9 rolls our. [16:20:06] Err. Rolls out. [16:20:08] James_F: kk [16:20:27] 259041 is the same. [16:21:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259041 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester) [16:21:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: VisualEditor: Enable single edit tab mode on test2wiki [[gerrit:259039]] (duration: 00m 29s) [16:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:36] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 9 connecting: (unnamed),kafka1012_v4 [16:21:36] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 1 connecting: cp4011_v4 [16:22:12] (03Merged) 10jenkins-bot: VisualEditor: Centralise feedback from test2wiki to MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259041 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester) [16:23:55] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: VisualEditor: Centralise feedback from test2wiki to MediaWiki.org [[gerrit:259041]] (duration: 00m 30s) [16:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:05] ^ James_F check if possible [16:24:43] * James_F checks. [16:24:50] jgirault: and/or jan_drewniak ping for SWAT [16:25:55] thcipriani: LGTM. [16:26:02] James_F: cool, thanks! [16:26:43] !log decommission restbase1004 [16:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:48] urandom gwicke mobrovac ^ [16:28:37] thcipriani: ahoy [16:28:46] jan_drewniak: hiya [16:29:07] ready for portals bump? Anything special needed here aside from a sync? [16:29:33] nothing special required for this deploy [16:29:33] (03CR) 10Jcrespo: [C: 032] Reconfigure db1042 and all s4 codfw mysqls [puppet] - 10https://gerrit.wikimedia.org/r/259269 (owner: 10Jcrespo) [16:29:41] (03PS1) 10Cmjohnson: Removing install references to cobalt and nickel bug: task# T121007 [puppet] - 10https://gerrit.wikimedia.org/r/259275 [16:29:49] godog: thanks! [16:29:50] kk [16:30:47] !log restarting and reconfiguring mysql on db1042 [16:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259176 (owner: 10JGirault) [16:32:11] (03Merged) 10jenkins-bot: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259176 (owner: 10JGirault) [16:32:22] (03PS2) 10Cmjohnson: Removing install references to cobalt and nickel bug: task# T121007 [puppet] - 10https://gerrit.wikimedia.org/r/259275 [16:32:24] nice [16:34:31] !log thcipriani@tin Synchronized portals: SWAT: Bump portals to master (duration: 00m 29s) [16:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:36] ^ jgirault jan_drewniak check please [16:34:54] (03PS1) 10Cmjohnson: Removing remaining dns entries for cobalt and nickel [dns] - 10https://gerrit.wikimedia.org/r/259278 [16:35:07] (03CR) 10Cmjohnson: [C: 032] Removing install references to cobalt and nickel bug: task# T121007 [puppet] - 10https://gerrit.wikimedia.org/r/259275 (owner: 10Cmjohnson) [16:35:49] thcipriani: looks all good :) [16:36:00] jgirault: nice, thanks for checking! [16:36:58] (03PS2) 10Cmjohnson: Removing remaining dns entries for cobalt and nickel [dns] - 10https://gerrit.wikimedia.org/r/259278 [16:37:48] (03CR) 10Cmjohnson: [C: 032] Removing remaining dns entries for cobalt and nickel [dns] - 10https://gerrit.wikimedia.org/r/259278 (owner: 10Cmjohnson) [16:39:56] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: puppet fail [16:41:56] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:24] (03CR) 10Filippo Giunchedi: [C: 031] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [16:42:36] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:06] PROBLEM - Disk space on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:07] PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:15] PROBLEM - Check size of conntrack table on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:16] PROBLEM - dhclient process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:26] PROBLEM - nutcracker port on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:55] PROBLEM - salt-minion processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:44:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:44:56] PROBLEM - HHVM processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:17] PROBLEM - nutcracker process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:26] PROBLEM - SSH on mw1146 is CRITICAL: Server answer [16:45:26] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:45:47] PROBLEM - configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:46:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [16:46:26] again goddammit [16:47:00] (03PS3) 10Ottomata: Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) [16:47:36] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 2 ESP OK [16:47:37] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 10 ESP OK [16:48:06] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:52:16] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:55:13] (03CR) 10BryanDavis: [C: 031] "This looks like it should work. Tested an untruncated log line given in T120874 vs the "%{SYSLOGLINE}" Logstash grok pattern using https:/" [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) (owner: 10Ottomata) [16:57:32] (03CR) 10Ottomata: "Yeah, but I think giving it a shorter name is not practical. 'eventlogging' is the main name of the codebase that is running 'service' is" [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) (owner: 10Ottomata) [16:58:16] RECOVERY - Disk space on mw1146 is OK: DISK OK [16:58:43] (03PS1) 10BBlack: Revert "ipsec-for-kafka: limit to kafka1012 + cp4011 for testing" + Revert "post-merge fixup for 8b9dfe360" [puppet] - 10https://gerrit.wikimedia.org/r/259280 (https://phabricator.wikimedia.org/T92602) [16:58:59] (03PS2) 10BBlack: Revert "ipsec-for-kafka: limit to kafka1012 + cp4011 for testing" [puppet] - 10https://gerrit.wikimedia.org/r/259280 (https://phabricator.wikimedia.org/T92602) [16:59:19] (03CR) 10BBlack: [C: 032 V: 032] Revert "ipsec-for-kafka: limit to kafka1012 + cp4011 for testing" [puppet] - 10https://gerrit.wikimedia.org/r/259280 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [16:59:47] (03PS4) 10Ottomata: Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) [17:00:03] !log restarting and reconfiguring mysql at db2018 [17:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:26] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:00:26] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:00:34] (03PS1) 10EBernhardson: Update to latest version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259281 [17:00:49] anyone mind if i sneak one last patch into swat? its a simple config change [17:00:56] thcipriani: ^ ? [17:01:00] i can deploy it [17:01:22] would have put it in earlier but i just stepped into office [17:01:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:01:42] ebernhardson: go for it. There aren't any deploys between now and Train. [17:01:52] (03CR) 10EBernhardson: [C: 032] Update to latest version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259281 (owner: 10EBernhardson) [17:02:33] (03Merged) 10jenkins-bot: Update to latest version of cirrus avro schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259281 (owner: 10EBernhardson) [17:02:40] RECOVERY - HHVM processes on mw1146 is OK: PROCS OK: 6 processes with command name hhvm [17:03:01] RECOVERY - Check size of conntrack table on mw1146 is OK: OK: nf_conntrack is 0 % full [17:03:02] RECOVERY - DPKG on mw1146 is OK: All packages OK [17:03:11] RECOVERY - dhclient process on mw1146 is OK: PROCS OK: 0 processes with command name dhclient [17:03:22] RECOVERY - nutcracker port on mw1146 is OK: TCP OK - 0.000 second response time on port 11212 [17:03:30] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [17:03:42] RECOVERY - SSH on mw1146 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:03:59] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Update cirrus avro schema to 111448028943 (duration: 00m 29s) [17:04:01] RECOVERY - configured eth on mw1146 is OK: OK - interfaces up [17:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:20] RECOVERY - nutcracker process on mw1146 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:04:51] RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:05:51] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:06:20] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:07:12] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2002_v4, cp2002_v6, cp2003_v4, cp2003_v6, cp2005_v4, cp2005_v6, cp2006_v4, cp2006_v6, cp2007_v4, cp2007_v6, cp2009_v4, cp2009_v6, cp2010_v4, cp2010_v6, cp2011_v4, cp2011_v6, cp2012_v4, cp2012_v6, cp2013_v4, cp2013_v6, cp2015_v4, cp2015_v6, cp2016_v4, cp2016_v6, cp2017_v4, cp2017_v6, cp2018_v4, cp2018_v6, cp2019_v4, cp2019_v6, cp2020_v4, cp2020_v6, cp202 [17:09:56] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881029 (10Eevans) 3NEW a:3Eevans [17:11:12] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 166 ESP OK [17:11:16] 7Puppet, 6Analytics-Kanban, 5Patch-For-Review: Puppet support for multiple Dashiki instances running on one server [8 pts] - https://phabricator.wikimedia.org/T120891#1881039 (10Milimetric) [17:12:38] the last log should have been db2019, not 18 [17:20:41] PROBLEM - Disk space on restbase1004 is CRITICAL: DISK CRITICAL - free space: /var 104681 MB (3% inode=99%) [17:22:08] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881062 (10Milimetric) @Dzahn: I wasn't aware that it had changed from a VM to real hardware, where wa... [17:25:33] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881072 (10Eevans) [17:26:05] (03CR) 10DCausse: [WIP] Cron job to rebuild completion indices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [17:27:33] (03PS1) 10Jcrespo: Repool db1027, Depool db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259285 [17:29:45] !log beginning `nodetool cleanup' on restbase1002.eqiad (https://phabricator.wikimedia.org/T121535) [17:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:15] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1881092 (10RobH) I've sent some out of band notices trying to push this system allocation approval along (spare in eqiad). Updates to follow. [17:30:21] !log aborted large compaction on restbase1004 with `nodetool stop -- COMPACTION` to free disk space [17:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:39] 6operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1881094 (10cscott) I'll take a look at this. My recollection is that I found sort of fundamental problems with the way servers are depooled (T1... [17:30:41] RECOVERY - Disk space on restbase1004 is OK: DISK OK [17:31:40] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1881103 (10Papaul) I found out that the same ssh problem on db2034 are on 8 others boxes . (db2035 to db2042) discussed this with Jynus on IRC he mentioned that he no... [17:32:03] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881105 (10Eevans) [17:33:24] !log beginning `nodetool cleanup' on restbase1005.eqiad (https://phabricator.wikimedia.org/T121535) [17:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:01] (03PS5) 10Ottomata: Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) [17:34:26] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881111 (10Eevans) [17:35:34] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1881129 (10jcrespo) Thank you, @Papaul! [17:36:23] I've been in your situation and debuging those kind of problems are not easy [17:36:35] jynus: yw [17:37:20] (03CR) 10Jcrespo: [C: 032] Repool db1027, Depool db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259285 (owner: 10Jcrespo) [17:37:31] PROBLEM - Varnishkafka Delivery Errors per minute on cp2008 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [20000.0] [17:39:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1027, Depool db1049 (duration: 00m 30s) [17:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:31] RECOVERY - Varnishkafka Delivery Errors per minute on cp2008 is OK: OK: Less than 80.00% above the threshold [0.0] [17:41:25] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1881180 (10mark) I was pretty sure I had given approval before... Anyway: approved. [17:41:46] (03PS1) 10BBlack: kafka::server: allow ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/259290 [17:44:03] (03PS2) 10BBlack: kafka::server: allow ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/259290 (https://phabricator.wikimedia.org/T92602) [17:44:42] (03CR) 10BBlack: [C: 032 V: 032] kafka::server: allow ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/259290 (https://phabricator.wikimedia.org/T92602) (owner: 10BBlack) [17:45:29] (03CR) 10Andrew Bogott: "cleanup-pam-config leaves .orig backup files behind in the pam.d directory. This is likely to cause us problems. See bug https://phabric" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [17:45:36] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881223 (10Eevans) [17:45:40] (03PS4) 10Giuseppe Lavagetto: puppet: extract the common parts of operations/puppet git clone [puppet] - 10https://gerrit.wikimedia.org/r/259225 [17:47:11] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: puppet fail [17:48:19] !log restart and reconfigure mysql at db1049 [17:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:29] !log beginning `nodetool cleanup' on restbase1003.eqiad (https://phabricator.wikimedia.org/T121535) [17:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:12] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881286 (10Eevans) [17:49:32] PROBLEM - Varnishkafka Delivery Errors per minute on cp2008 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [20000.0] [17:50:13] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1012_v6 [17:50:41] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1022_v4 [17:50:51] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: kafka1014_v6,kafka1022_v6 [17:51:20] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1022_v4 [17:51:40] RECOVERY - Varnishkafka Delivery Errors per minute on cp2008 is OK: OK: Less than 80.00% above the threshold [0.0] [17:51:51] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 37 connecting: kafka1013_v4 [17:52:04] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1881321 (10Dzahn) @Papaul hmm.. i guess then we should report it to Dell as broken drac [17:52:10] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: kafka1013_v4 [17:52:10] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: kafka1013_v4,kafka1022_v6 [17:52:20] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 28 ESP OK [17:52:41] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK [17:52:59] Dzahn: those are HP [17:53:02] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1014_v6 [17:53:21] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: kafka1013_v4 [17:53:22] ignore the ipsec crap [17:53:51] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 38 ESP OK [17:54:11] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 20 ESP OK [17:54:11] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 28 ESP OK [17:54:21] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1881370 (10Dzahn) @Papaul cool! nice work. confirmed SSH works again. disregard my former comment, i had the tab open from earlier and not seen the latest updates. [17:54:52] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 28 ESP OK [17:55:02] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 28 ESP OK [17:55:13] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1881393 (10Lydia_Pintscher) Another one: https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&diff=283743387&oldid=283679751 [17:55:21] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [17:55:21] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 28 ESP OK [17:55:34] !log manually enabled ipsec rules in iptables on kafka10xx - puppet disabled for now until I can fix the puppetization of it... [17:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:16] 7Puppet, 6Analytics-Backlog, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} - https://phabricator.wikimedia.org/T101763#1881402 (10Milimetric) >>! In T101763#1864407, @yuvipanda wrote: > Hello! After every time we change any fundam... [17:57:20] (03PS1) 10Jcrespo: Reconfiguring mysql for db1049 and all s5 codfw databases [puppet] - 10https://gerrit.wikimedia.org/r/259292 [18:01:21] (03CR) 10Jcrespo: [C: 032] Reconfiguring mysql for db1049 and all s5 codfw databases [puppet] - 10https://gerrit.wikimedia.org/r/259292 (owner: 10Jcrespo) [18:01:57] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1881432 (10ArielGlenn) (After chat with mark on IRC) Currently we have 4 snapshot hosts, one of which is dedicated for en wp... [18:02:31] (03PS1) 10BBlack: kafka::server - fix ESP ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/259293 [18:03:02] (03CR) 10BBlack: [C: 032 V: 032] kafka::server - fix ESP ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/259293 (owner: 10BBlack) [18:03:55] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1881438 (10Dzahn) @akosiaris gotcha. thanks for explaining! the re-using of names can make it a bit tricky. hmm..it made me think maybe it's better to change the name when VMizing but that's not a... [18:04:02] (03PS1) 10Cmjohnson: Adding mgmt with wmf only dns for 8 misc servers bug: task# T121532 [dns] - 10https://gerrit.wikimedia.org/r/259294 [18:05:25] (03PS2) 10Cmjohnson: Adding mgmt with wmf only dns for 8 misc servers bug: task# T121532 [dns] - 10https://gerrit.wikimedia.org/r/259294 [18:05:41] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:56] (03PS1) 10Andrew Bogott: clean-pam-config: move backupfiles do a different dir [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) [18:06:48] (03CR) 10Cmjohnson: [C: 032] Adding mgmt with wmf only dns for 8 misc servers bug: task# T121532 [dns] - 10https://gerrit.wikimedia.org/r/259294 (owner: 10Cmjohnson) [18:10:07] 6operations, 10RESTBase-Cassandra: track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#1881469 (10Dzahn) [18:10:09] 6operations, 7HTTPS, 7Icinga, 7Monitoring: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1881468 (10Dzahn) [18:11:57] 6operations, 6Performance-Team: Provision additional jobrunners - https://phabricator.wikimedia.org/T121549#1881473 (10ori) 3NEW [18:11:59] hey bblack, we can probably do something so that one of the kafka brokers is not a leader for any partition [18:12:05] this would mean by default caches wouldn't produce to it [18:12:09] would that make testing safer? [18:12:35] it's working now, so not really an issue anymore [18:12:38] ok [18:13:28] there's a data dropout from about 17:05 -> 17:50 [18:13:31] http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&c=Analytics+Kafka+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [18:14:04] I'm not sure why that didn't trigger a mass of kafka alerts, but basically only eqiad was sending kafka data to kafka10xx during that window, not the other 3x DCs' caches. [18:14:36] will it have spooled that up and replayed it in the long run, or just lost? [18:14:59] (03CR) 10EBernhardson: [C: 04-1] "needs a cron run per elasticsearch cluster" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [18:15:01] i doubt it, depends on the caches. varnishkafka's buffers aren't that big [18:15:43] yeah, lots of delivery errors [18:15:43] (03PS1) 10Ori.livneh: Convert mw1161 to a job runner [puppet] - 10https://gerrit.wikimedia.org/r/259298 (https://phabricator.wikimedia.org/T121549) [18:15:43] hm [18:15:55] (03PS2) 10Ori.livneh: Convert mw1161 to a job runner [puppet] - 10https://gerrit.wikimedia.org/r/259298 (https://phabricator.wikimedia.org/T121549) [18:15:59] bblack there were varnishkafka alerts [18:16:04] (03CR) 10Ori.livneh: [C: 032 V: 032] Convert mw1161 to a job runner [puppet] - 10https://gerrit.wikimedia.org/r/259298 (https://phabricator.wikimedia.org/T121549) (owner: 10Ori.livneh) [18:16:25] (03CR) 10Rush: clean-pam-config: move backupfiles do a different dir (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) (owner: 10Andrew Bogott) [18:16:52] ottomata: I saw a few, but we had 83x cache hosts unable to connect to kafka brokers at all for 45 minutes. I definitely didn't see all that [18:17:50] e.g. 17:49 < icinga-wm> PROBLEM - Varnishkafka Delivery Errors per minute on cp2008 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [20000.0] [18:18:13] 6operations, 10RESTBase-Cassandra: track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#1881511 (10Dzahn) We can use the same method as in T116332 here with `modules/nagios_common/files/check_commands/check_ssl_certfile` that we can install where the cert is ([[ https://gerrit.wik... [18:19:02] <_joe_> ori: you're using depooled machines? [18:19:16] <_joe_> seems sensible [18:19:25] only mw1161 is depooled [18:19:33] not sure for what reason [18:19:46] <_joe_> uhm, look at the git history on palladium [18:19:48] !log changing db2019 master to be db1042 instead of m4-master [18:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:52] <_joe_> that /might/ help [18:20:15] (03PS5) 10Giuseppe Lavagetto: puppet: extract the common parts of operations/puppet git clone [puppet] - 10https://gerrit.wikimedia.org/r/259225 [18:20:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet: extract the common parts of operations/puppet git clone [puppet] - 10https://gerrit.wikimedia.org/r/259225 (owner: 10Giuseppe Lavagetto) [18:20:45] _joe_: > Added mw1041, depooled mw1161 (retroactive commit by joe) [18:20:46] (03PS1) 10Chad: ci: remove elasticsearch from browsertest slaves [puppet] - 10https://gerrit.wikimedia.org/r/259301 (https://phabricator.wikimedia.org/T89083) [18:21:36] 6operations, 6Performance-Team, 5Patch-For-Review: Provision additional jobrunners - https://phabricator.wikimedia.org/T121549#1881540 (10ori) Re-provisioning mw1161 went very smoothly. See attached Puppet log. {F3103807} [18:22:04] <_joe_> ori: heh, god damn [18:22:19] it was a busy time :) [18:22:22] <_joe_> I found those uncommitted changes, from $random_opsen [18:23:10] (03CR) 10Andrew Bogott: clean-pam-config: move backupfiles do a different dir (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) (owner: 10Andrew Bogott) [18:24:03] ottomata: 2 new kafka servers...analytics vlan? [18:24:16] <_joe_> cmjohnson1: I guess not [18:25:37] (03PS1) 10Giuseppe Lavagetto: conftool: fixup for e029eda [puppet] - 10https://gerrit.wikimedia.org/r/259302 [18:25:41] (03CR) 10Rush: clean-pam-config: move backupfiles do a different dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) (owner: 10Andrew Bogott) [18:26:13] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/259302 (owner: 10Giuseppe Lavagetto) [18:28:05] (03PS2) 10Ori.livneh: Convert mw1162-1169 to job runners [puppet] - 10https://gerrit.wikimedia.org/r/259207 (https://phabricator.wikimedia.org/T121549) [18:28:43] (03PS2) 10Andrew Bogott: clean-pam-config: move backupfiles to a different dir [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) [18:29:42] cmjohnson1: nope [18:30:02] ottomata: should we give them the kafka100x naming? [18:30:13] i think so. unless you think taht will be confusing [18:30:18] i'd call them kafka1001 and kafka1002 [18:30:58] okay...that works for me [18:32:07] alternatively, eventbus1001 [18:32:32] avoids the issue of multiple kafka clusters, with only one called 'kafka' [18:32:34] the eventbus is coming, and everybody's jumping, new york to san francisco, an intercity disco [18:32:47] whichever you prefer.....eventbus will probably work out better [18:32:58] (03PS3) 10Ori.livneh: Convert mw1162-1169 to job runners [puppet] - 10https://gerrit.wikimedia.org/r/259207 (https://phabricator.wikimedia.org/T121549) [18:33:42] I think my preference is slightly for eventbus, but no strong feelings either way [18:33:44] <_joe_> ori: ahahahahahhaahha [18:33:44] (03CR) 10Ottomata: "These are role classes. I'd be fine with setting values of kafka::server (from the kafka module) using automatic hiera values in a specif" [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) (owner: 10Ottomata) [18:33:59] ottomata: are you cool with eventbus? [18:34:05] (03CR) 10Ori.livneh: [C: 032 V: 032] Convert mw1162-1169 to job runners [puppet] - 10https://gerrit.wikimedia.org/r/259207 (https://phabricator.wikimedia.org/T121549) (owner: 10Ori.livneh) [18:34:05] hmmm [18:34:06] no [18:34:07] * robh has the naming infrastructure page open [18:34:21] we will colocate the eventbus service there for now, but probably not forever [18:34:34] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881604 (10Dzahn) >>! In T116312#1881062, @Milimetric wrote: > @Dzahn: I wasn't aware that it had chan... [18:34:42] this the main kafka cluster, and isn't limited to only eventbus stuff. e.g. Ori may produce to it directly for cache purges, no? [18:34:51] (btw, also in a meeting right now...) [18:34:55] kafka is the event bus [18:35:12] isn't the name for the rest service still open? [18:35:26] yes, it's for purges as well, and that is not going to go through eventbus [18:35:40] kafka it is [18:35:42] so yea, kafka it is [18:36:06] I guess I'm calling this kafka instance the event bus [18:36:15] so, question about the kafka hostnames [18:36:23] why are there 1012+ [18:36:36] different kafka instances, for different purposes [18:36:42] 1012, 1013, 1014, 1018, 1020, 1022. [18:36:49] its odd they arent in sequence? [18:36:51] this kafka instance is the main kafka cluster [18:36:51] well [18:36:53] main-eqiad [18:36:57] this kafka cluster [18:37:09] the one in codfw will be main-codfw (unless we can think of a better name than main) [18:37:19] I proposed eventbus.. [18:37:20] the current analytics one is analytics-eqiad [18:37:36] !log Depooled and drained mw1161-1169 app servers, now re-purposing as job runners, per T121549 [18:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:42] ottomata: Do you know why the current kafka hostnames are not in sequence? It poses an issue on what to name the new ones if there are ongoing hostname changes I'm not aware of. [18:37:45] yeah, but it isnt' just for eventbus, eventbus == schemed events + validation to specific topics [18:37:55] robh yes [18:37:57] historical reason [18:37:58] s [18:38:02] to me, that's the rest proxy [18:38:11] just renaming analyitics to kafka? if so, that was not a good way to do it imo. [18:38:13] the node names were originally analytics1012, analytics1022, etc. [18:38:15] while event bus is basically kafka with JSON events [18:38:24] why didnt you guys redo the hostname to the lower number? [18:38:34] (and follow a easily repeatable convention?) [18:38:36] the brokers have static 'ids' assigned to them [18:38:43] that aren't really changeable [18:38:47] and we upgraded this cluster, we didnt' repalce it [18:38:52] so I didn't want the mapping of hostname to id to change [18:38:54] e.g. [18:38:57] kafka1001 => 22 [18:39:03] better is the way it is now [18:39:05] kafka1022 => 22 [18:39:06] that is not very scalable. [18:39:22] i have a feeling there was a reason I didn't do kafk1022 => 1022 [18:39:25] but i can't remember [18:39:32] but ok, i cannot fix it now, so is ther any reason i cannot call these kafka1001, kafka1002? [18:39:34] i wan tthe new ones to be kafka1001 => 1001...unless i remembe rwhy [18:40:06] so the non sequential stuff will be confusing down the road when stuff ages, just fyi. [18:40:15] in the future, lets try to be a bit more sane in our hostname sequencing [18:40:18] +1 [18:40:23] robh would you prefer to number them differently? [18:40:30] i mean, nodes will change, so will hostnames, no? [18:40:34] not sure how to keep them sequential [18:40:39] its good to shoot for it [18:40:43] ottomata: I would have preferred it if when the hostnames changed from analytics1022 you would have put it to kafka1001 at that time [18:40:48] and make the backend changes needed then [18:41:05] robh, then the mappings would ahve been very confusing in topic/partition listings [18:41:08] That would have followed a more repeatable and scalable sequencing. [18:41:16] when showing what replicas are in sync, we see the broker ids, not the hostnames [18:41:20] so one listing versus the entirety of how we sequence hostnames [18:41:33] so decided your change trumped the infrastructure policy [18:41:34] but ok. [18:41:37] haha [18:41:37] ;p [18:41:40] I personally think that naming nodes by the function they provide is usually the most helpful; for example, it's better to call something an "api" server than "php" server [18:41:43] i think we worked together on these names, no?! :p [18:41:49] no [18:41:59] if i had been involved i would have recalled this no? [18:42:17] someone else helped me pick these...at least we argued about calling these kafka10xx vs something else [18:42:26] or maybe it was cmjohnson1 don't remember [18:42:29] aHHHh i am in a meeting [18:42:30] SHHHHH [18:42:40] You understand I'm not arguing the use if kafka [18:42:44] (ya) [18:42:51] gwicke: 20:32 < grrrit-wm> (PS3) Ori.livneh: Convert mw1162-1169 to job runners [puppet] - https://gerrit.wikimedia.org/r/259207 (https://phabricator.wikimedia.org/T121549) [18:42:55] gwicke: 10 minutes ago [18:42:57] but the fact that the hostname sequences were arbitrarily set to make it so you guys didnt have to do a config change [18:43:05] robh: if we had started new, i would have started at 1001 [18:43:06] gwicke: this is why mw* servers are generically made [18:43:10] but we weren't starging with new nodes [18:43:13] er, named [18:43:18] kafka was already running on them [18:43:27] so hostnames are typically a reinstall [18:43:33] but you did something odd and non standard is my point of issue [18:43:36] thats all [18:43:42] gwicke: if these were named jobNNNN, ori would have to modify racktables, tell chris to grab a label maker and relabel them, etc. [18:43:42] yes, but we weren't reinstalling the whole cluster all at once [18:43:56] and there were configs that we couldn't change, namely the broker ids [18:44:04] ottomata: so why havent they been reinstalled since and had their hostnames sequnce lowered? [18:44:10] if a config cnanot change [18:44:12] its a broken config [18:44:19] a point of a config is to modify it =P [18:44:20] paravoid: my point is that they are called "mw" rather than "php" or "HHVM" [18:44:22] gwicke: so that's where the fairly "generic" names come -- and why renames are generally frowned upon, unless it's absolutely necessary [18:44:30] robh, it is a unique broker id, its not just a config, its iding a broker in a distributed cluster [18:44:46] ottomata: but again, im not blocking this install now. im not sure if you dont see my point or merely disagree [18:44:48] mediawiki is not their function, it's their software [18:44:56] robh, i see your point :) [18:44:57] if you disagree that is fine, folks dont have to agree with me ;] [18:45:02] cool [18:45:10] in general i agree with you 100%, i just think in this case there was good reason not to renumber [18:45:10] appserver, api server, imagescaler, videoscaler and job runner are their functions, so far [18:45:12] buuuuuut ja [18:45:18] but anyway [18:45:22] ottomata: yep, but everyone thinks their reason is good ;] [18:45:24] i will argue about names for new hosts in 20 mins [18:45:25] ... [18:45:30] MEETING [18:45:31] SHHHH [18:45:34] hehe, the new hosts are kafka1001-1002 [18:45:41] i wasnt arguign that, its cool [18:45:51] (again, i wasnt mad either, tone in irc is hard!) [18:45:53] paravoid: sure, there's different levels of granularity that can be applied [18:47:49] (03CR) 10Andrew Bogott: [C: 04-1] "So once we remove /etc/security/access.conf from base images, that means that anyone at all can ssh in until after the first puppet run, r" [puppet] - 10https://gerrit.wikimedia.org/r/257411 (https://phabricator.wikimedia.org/T120710) (owner: 10coren) [18:49:14] (03PS2) 10Chad: new_wmf_service.py: fix pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256311 [18:49:26] (03PS1) 10Ori.livneh: Follow-up for Ie5a79a8c17d: delist mw1162-1169 in conftool manifest, too [puppet] - 10https://gerrit.wikimedia.org/r/259306 (https://phabricator.wikimedia.org/T121549) [18:49:33] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, 7Availability: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1881640 (10RobH) 3NEW a:3Cmjohnson [18:49:43] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1881640 (10RobH) [18:49:43] (03CR) 10jenkins-bot: [V: 04-1] Follow-up for Ie5a79a8c17d: delist mw1162-1169 in conftool manifest, too [puppet] - 10https://gerrit.wikimedia.org/r/259306 (https://phabricator.wikimedia.org/T121549) (owner: 10Ori.livneh) [18:49:56] (03PS2) 10Ori.livneh: Follow-up for Ie5a79a8c17d: delist mw1162-1169 in conftool manifest, too [puppet] - 10https://gerrit.wikimedia.org/r/259306 (https://phabricator.wikimedia.org/T121549) [18:50:07] (03CR) 10jenkins-bot: [V: 04-1] new_wmf_service.py: fix pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [18:50:40] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1881653 (10Ottomata) You can hand this off to me in that last step. [18:51:48] (03Abandoned) 10Chad: new_wmf_service.py: fix pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [18:51:55] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1881671 (10EBernhardson) I should add all connections are opened by the hadoop workers for communication to the elastics... [18:53:16] there has just been a spike of reads on commons [18:54:46] 6operations, 10ops-codfw, 5Patch-For-Review: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1881688 (10RobH) [18:55:03] 6operations, 10ops-codfw, 5Patch-For-Review: codfw: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1881689 (10RobH) [18:55:39] lots of Revision::fetchFromConds 127.0.0.1 [18:57:23] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881702 (10RobH) 3NEW a:3RobH [18:57:27] !log repooling cp1053 (eqiad text cache) [18:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151215T1900). [19:00:13] !log starting branch cut for 1.27.0-wmf.9 [19:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:54] 6operations, 10Analytics, 10Traffic: Upgrade kafka for native TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881737 (10BBlack) 3NEW [19:01:09] 6operations, 10Analytics-Cluster, 10Traffic, 5Patch-For-Review: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1881747 (10BBlack) 5Open>3Resolved a:3BBlack [19:01:11] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1881749 (10BBlack) [19:01:16] (03PS1) 10Jcrespo: Repool db1042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259308 [19:02:01] 6operations, 10Analytics, 10Traffic: Upgrade kafka for native TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881750 (10Ottomata) a:3Ottomata [19:02:04] 6operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Upgrade kafka for native TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881752 (10BBlack) [19:02:43] 6operations, 10Analytics, 6Analytics-Backlog, 10Analytics-Cluster, 10Traffic: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1881753 (10Ottomata) 3NEW a:3Ottomata [19:03:08] 6operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881760 (10Ottomata) [19:03:25] (03CR) 10Ori.livneh: [C: 032] Follow-up for Ie5a79a8c17d: delist mw1162-1169 in conftool manifest, too [puppet] - 10https://gerrit.wikimedia.org/r/259306 (https://phabricator.wikimedia.org/T121549) (owner: 10Ori.livneh) [19:05:43] (03PS6) 10Ottomata: Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) [19:06:29] (03PS1) 10GWicke: Reduce the number of restbase runners to limit parallelism [puppet] - 10https://gerrit.wikimedia.org/r/259309 [19:07:14] (03PS2) 10Ori.livneh: Reduce the number of restbase runners to limit parallelism [puppet] - 10https://gerrit.wikimedia.org/r/259309 (owner: 10GWicke) [19:07:39] gwicke: LGTM. I see you added mobrovac as a reviewer -- would you like me to hold off on merging this until he has a chance to look, or is it good to go? [19:08:02] (03CR) 10Ottomata: [C: 032] Increase size of programname field in remote syslog template [puppet] - 10https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) (owner: 10Ottomata) [19:08:18] ori: it's good to go, but it's good to have mobrovac be aware of it as well [19:08:24] !log merged change to allow longer programnames in remote rsyslog config. [19:08:27] nod [19:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:37] (03PS3) 10Ori.livneh: Reduce the number of restbase runners to limit parallelism [puppet] - 10https://gerrit.wikimedia.org/r/259309 (owner: 10GWicke) [19:08:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Reduce the number of restbase runners to limit parallelism [puppet] - 10https://gerrit.wikimedia.org/r/259309 (owner: 10GWicke) [19:09:35] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881764 (10Dzahn) [19:10:16] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746196 (10Dzahn) I re-added vm-requests. But if we actually use stat1001 for it, then it doesn't need either kin... [19:14:00] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881789 (10Joe) I think piwik will need to have its own database hosted on the very same machine, am I correct?... [19:16:53] (03PS1) 10Jcrespo: Repool db1049 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259313 [19:18:24] 6operations, 10Traffic, 7HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580#1881801 (10BBlack) [19:18:25] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1881800 (10BBlack) [19:19:16] ottomata: so https://phabricator.wikimedia.org/T117727 -> https://phabricator.wikimedia.org/T97294 is all done right? [19:19:44] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:54] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:54] PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:27] bblack yes! [19:20:37] resovling.. [19:20:45] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1881811 (10Ottomata) [19:20:48] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1881810 (10Ottomata) 5Open>3Resolved [19:20:54] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1881813 (10Ottomata) [19:20:56] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1881812 (10Ottomata) 5Open>3Resolved [19:21:00] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Set up ops kafkatee instance as part of udp2log transition - https://phabricator.wikimedia.org/T96616#1881814 (10Ottomata) [19:21:14] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:54] PROBLEM - dhclient process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:03] PROBLEM - salt-minion processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:14] PROBLEM - nutcracker port on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:23] PROBLEM - Disk space on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:33] PROBLEM - Check size of conntrack table on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:35] PROBLEM - RAID on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:35] PROBLEM - SSH on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:53] PROBLEM - configured eth on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:04] PROBLEM - nutcracker process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:14] PROBLEM - HHVM processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:15] !log powercycling mw1129 [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:33] PROBLEM - puppet last run on mw1129 is CRITICAL: Timeout while attempting connection [19:23:33] PROBLEM - DPKG on mw1129 is CRITICAL: Timeout while attempting connection [19:24:23] PROBLEM - HHVM rendering on mw1154 is CRITICAL: Connection timed out [19:24:24] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:25] PROBLEM - HHVM rendering on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:33] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [19:24:33] PROBLEM - HHVM rendering on mw1158 is CRITICAL: Connection timed out [19:24:53] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [19:24:54] PROBLEM - HHVM rendering on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:04] PROBLEM - HHVM rendering on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:14] PROBLEM - HHVM rendering on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:14] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: Connection timed out [19:25:25] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw1153.eqiad.wmnet because of too many down!: swift_80 - Could not depool server ms-fe1004.eqiad.wmnet because of too many down! [19:25:33] PROBLEM - HHVM rendering on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:34] PROBLEM - HHVM rendering on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:43] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw1153.eqiad.wmnet because of too many down!: swift_80 - Could not depool server ms-fe1004.eqiad.wmnet because of too many down! [19:25:43] RECOVERY - dhclient process on mw1129 is OK: PROCS OK: 0 processes with command name dhclient [19:25:45] RECOVERY - salt-minion processes on mw1129 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:25:45] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:45] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:45] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:53] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:53] hmm [19:25:54] RECOVERY - nutcracker port on mw1129 is OK: TCP OK - 0.000 second response time on port 11212 [19:26:04] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:05] RECOVERY - Disk space on mw1129 is OK: DISK OK [19:26:14] RECOVERY - Check size of conntrack table on mw1129 is OK: OK: nf_conntrack is 0 % full [19:26:24] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw1153.eqiad.wmnet because of too many down!: swift_80 - Could not depool server ms-fe1004.eqiad.wmnet because of too many down! [19:26:24] RECOVERY - RAID on mw1129 is OK: OK: no RAID installed [19:26:24] RECOVERY - SSH on mw1129 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:26:32] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:44] RECOVERY - configured eth on mw1129 is OK: OK - interfaces up [19:27:03] RECOVERY - nutcracker process on mw1129 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:27:13] RECOVERY - HHVM processes on mw1129 is OK: PROCS OK: 6 processes with command name hhvm [19:27:23] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:34] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [19:27:35] RECOVERY - DPKG on mw1129 is OK: All packages OK [19:27:45] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.259 second response time [19:27:55] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 65807 bytes in 1.309 second response time [19:27:56] what's going on? [19:28:24] swift, rendering paging, lots of hhvm dead above? [19:28:31] anyone already have a clue? [19:28:37] not yet [19:28:39] mw1129 first just looked like a normal crash, and i powercycled it [19:28:43] but that's all so far [19:28:57] the swift box it talks about is up and running swift-proxy [19:29:03] PROBLEM - Swift HTTP frontend on ms-fe1003 is CRITICAL: Connection timed out [19:29:29] Dec 15 19:19:57 mw1158 kernel: [2397215.003116] Task in /mediawiki/job/25283 killed as a result of limit of /mediawiki/job/25283 [19:29:35] swift-proxy start/running [19:29:40] looks like a job got oomkilled on mw1158? [19:30:05] and there was a puppet run just before doing rsyslog change [19:30:12] hah [19:30:15] yes [19:30:16] that was it [19:30:25] rsyslog changes on swift cause swift outages [19:30:42] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 15119 bytes in 0.082 second response time [19:30:42] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.011 second response time [19:30:53] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 0.114 second response time [19:30:54] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [19:30:54] RECOVERY - HHVM rendering on mw1158 is OK: HTTP OK: HTTP/1.1 200 OK - 65807 bytes in 0.119 second response time [19:30:55] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: Connection timed out [19:30:56] i'm trying to dig up the details [19:30:57] :) so the puppet run fixed it or a person? [19:31:04] PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: Connection timed out [19:31:15] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [19:31:23] RECOVERY - HHVM rendering on mw1157 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 0.117 second response time [19:31:25] RECOVERY - HHVM rendering on mw1155 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 0.119 second response time [19:31:29] * apergos peeks in [19:31:30] looks at ms-fe1004 hrmm [19:31:35] RECOVERY - HHVM rendering on mw1159 is OK: HTTP OK: HTTP/1.1 200 OK - 65807 bytes in 0.120 second response time [19:31:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:31:47] image uploads on wikipedia appear to be returning a 503 often for me [19:31:50] Indeed, did someone take an actual corrective action or did it just flap back? [19:31:50] rsyslog change is: https://gerrit.wikimedia.org/r/#/c/259271/ [19:31:54] RECOVERY - HHVM rendering on mw1160 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 0.127 second response time [19:31:54] RECOVERY - HHVM rendering on mw1156 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 0.114 second response time [19:32:04] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [19:32:04] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [19:32:04] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [19:32:05] 09:56 ori: paravoid: umm, shit -- what was that awful thing that would happen when rsyslog was restarted? [19:32:05] 09:57 paravoid: swift dies. [19:32:09] I didn't take any actions, was only looking [19:32:11] for example, https://upload.wikimedia.org/wikipedia/commons/thumb/8/82/Parral_Samay_Huasi.jpg/1280px-Parral_Samay_Huasi.jpg prints Error: 503, Service Unavailable at Tue, 15 Dec 2015 19:30:56 GMT [19:32:13] runs puppet on ms-fe1004 to look for rsyslog change [19:32:14] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [19:32:24] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [19:32:26] niedzielski: known [19:32:39] ori: thank you [19:32:42] doesnt see any change or error.. service is running just like it was on ms-fe1002... [19:32:43] RECOVERY - HHVM rendering on mw1154 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 0.123 second response time [19:32:47] only looking but it seems it's recovering on its own as I guess [19:32:57] i forget what needs to be done to fix it, if anything [19:33:24] ori: thx for the (albeit depressing) paste share ;] [19:33:57] swift-proxy start/running [19:33:58] https://wikitech.wikimedia.org/wiki/Incident_documentation/20140910-swift-syslog "rolling restart of swift frontends" but maybenot needed here [19:34:07] ms-fe1004:~# service swift-proxy status [19:34:19] icinga-wm no recovery though for that? [19:34:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [1000.0] [19:34:43] (03CR) 10Ottomata: [C: 032 V: 032] Upstream release of 2.1.0 [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/257974 (owner: 10Ottomata) [19:34:50] is it actually up? [19:34:57] probably [19:35:25] yea, i mean, i see swift-proxy-service processes in top and they are changing too [19:35:29] <_joe_> are you guys on it? [19:35:38] <_joe_> I have people here for dinner [19:35:39] no, still has issues [19:35:43] <_joe_> call me if I'm needed [19:35:45] _joe_: I think we're ok-ish [19:35:48] <_joe_> (on the phone I mean) [19:35:49] haha [19:35:50] https://wikitech.wikimedia.org/w/index.php?title=Swift/TODO&diff=115824&oldid=115809 [19:35:58] '* Figure out a way to fix the [[Incident_documentation/20131205-Swift|"restarting syslog kills Swift"]] bug' [19:36:15] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:36:17] two years and ten days ago [19:36:25] ouch [19:36:38] has anyone restarted any of the swift proxies yet manually? [19:36:48] no, and i think it is needed [19:36:48] i'm doing that right now on ms-fe1004 [19:36:53] or... [19:36:59] mutante: go ahead [19:37:07] mutante: and the rest, plz [19:37:14] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:37:38] !log ms-fe1004, swift-proxy-server stop/start [19:37:42] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.061 second response time [19:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:51] that was instant [19:38:20] yeah we'll need to do all 4 [19:38:36] i assume daniel is doing them all and jumping in will confuse it... [19:38:41] I [19:38:45] I'm assuming that too [19:38:48] yes [19:39:04] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.015 second response time [19:39:05] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [19:39:14] RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.006 second response time [19:39:24] RECOVERY - Swift HTTP frontend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.004 second response time [19:39:32] !log ms-fe1001 thru ms-fe1003: swift-proxy-server stop/start [19:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:54] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.014 second response time [19:39:54] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.020 second response time [19:40:11] nice to see the new pybal checks work too [19:40:14] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [19:40:23] I didn't even see that those had gone, buried in all the messages above [19:40:28] now i wanna see the pybal... there we go [19:40:34] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [19:40:34] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.016 second response time [19:40:59] should be another pybal recover from 1009 coming as well [19:41:06] one more rogue app server mw1146 from 3h ago [19:41:32] on the plus side, the incident documentation can be copy-pasted from https://wikitech.wikimedia.org/wiki/Incident_documentation/20131205-Swift [19:41:39] yeah [19:42:29] !log mw1146: hhvm restart [19:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:53] has anyone sent an e-mail to outage-notification@wikimedia.org ? [19:42:53] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.443 second response time [19:42:56] sca1001/sca1002 - graphoid endpoint health ? [19:43:07] oh lvs1009 is normal, it's got other issues that are irrelevant [19:43:33] probably not necessary at this point (re email) [19:44:12] unless there was secondary fallout (don't think so?) it was just the upload.wm.o service, 5xx-ing on cache misses from ~19:24 -> ~19:38 [19:44:55] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 65807 bytes in 1.165 second response time [19:44:58] (and those miss->503 rate capped out around 1300/sec during that, out of ~70K/sec total reqs) [19:45:19] that is right, 15 minutes https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1450187094440&to=1450208694441&var-site=All&var-cache_type=All&var-status_type=5 [19:45:51] just the 5xx checks aren't recovered yet [19:46:03] yeah that's the dashboard I'm looking at too. you can select via cache_type there and see it's only in upload, too [19:46:15] waits for "data to be below the critical thresholds" [19:46:19] (and the comparative 5xx/total) [19:46:30] mutante, it has a lag :-) [19:46:46] eqiad says 70%, all others just 40% [19:47:34] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:48:03] mutante: it's possible that's because of cache hitrate differentials [19:48:44] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:49:01] alright, that looks good [19:49:48] this is a long shot, but could any of this have to do with commons increased load, even if indirectly ("people reloading more")? [19:50:14] or maybe both have a common cause [19:50:14] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:51:19] (03PS1) 10Dzahn: icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 [19:52:21] (03CR) 10jenkins-bot: [V: 04-1] icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [19:53:38] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881942 (10Eevans) [19:53:49] (03PS2) 10Dzahn: icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 [19:54:05] (03CR) 10Dzahn: "@Andrew Bogott: trying that here: https://gerrit.wikimedia.org/r/#/c/259319/1" [puppet] - 10https://gerrit.wikimedia.org/r/259073 (https://phabricator.wikimedia.org/T120047) (owner: 10Dzahn) [19:54:43] (03CR) 10Jcrespo: [C: 032] Repool db1042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259308 (owner: 10Jcrespo) [19:55:01] I'll go back an incident doc [19:55:01] There are several stuck parsoid processes because of https://phabricator.wikimedia.org/T104523 .. I am going to try restarting parsoid on wtp1019 ... if the stuck processes aren't killed, i'll need a root to kill those for me. [19:55:25] (03CR) 10jenkins-bot: [V: 04-1] icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [19:55:25] is someone around to help with it? [19:55:27] (03Abandoned) 10Yuvipanda: Revert "mediawiki_singlenode: rename defined type" [puppet] - 10https://gerrit.wikimedia.org/r/259169 (owner: 10Yuvipanda) [19:55:32] subbu: yes [19:55:35] ok. [19:55:39] (03CR) 10Jcrespo: [C: 032] Repool db1049 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259313 (owner: 10Jcrespo) [19:56:05] s/back/write/ heh [19:56:23] looks like the restart went through cleanly. [19:56:44] ok, cool [19:57:04] !log restarted parsoid on wtp1019 to kill stuck processes [19:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:33] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1049 and db1042 (duration: 00m 29s) [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:50] i will restart on all nodes to kill similar processes elsewhere .. that was just a test float to see if i needed root assistance. [19:59:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:59:34] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1881979 (10Eevans) Space is getting quite tight here; For example, with 1003 cleaned, there is 1.3T of free space, but after the current stream from 1004 c... [19:59:35] (03PS3) 10Dzahn: icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 [20:00:09] !log restarted parsoid on all nodes to kill stuck processes (thanks to T104523) [20:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:56] (03CR) 10jenkins-bot: [V: 04-1] icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [20:01:04] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881985 (10Papaul) For the task I will be using wmf6377 in row A and wmf6379 in row bB [20:03:14] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1882001 (10Nuria) @tgr: Sorry but I certainly disagree that we want to allow events of *any*... [20:03:58] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1882012 (10Eevans) [20:09:08] (03PS4) 10Dzahn: icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 [20:10:46] (03CR) 10Andrew Bogott: [C: 031] "This is just what I wanted :)" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [20:17:46] (03Abandoned) 10Dzahn: openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [20:21:23] !log 1.27.0-wmf.9 branching complete 1h 18m 2s, checking out to tin [20:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:18] branching took 1h 18m? [20:22:23] thcipriani: Did you do it at home? [20:22:25] (03PS1) 10Yuvipanda: k8s: Have flannel do ip-masq [puppet] - 10https://gerrit.wikimedia.org/r/259325 (https://phabricator.wikimedia.org/T120561) [20:22:34] Reedy: I did. [20:22:45] I'd highly reccommend against doing that [20:22:47] For that reason :) [20:22:57] I learned this very early on [20:23:16] gotta do it on a boat [20:23:23] It was 10-20 minutes from the cluster [20:23:27] (03CR) 10Yuvipanda: [C: 032] k8s: Have flannel do ip-masq [puppet] - 10https://gerrit.wikimedia.org/r/259325 (https://phabricator.wikimedia.org/T120561) (owner: 10Yuvipanda) [20:23:28] ShiveringPanda: Boat deploy or gtfo [20:23:37] can't use my key on tin, gotta forward 2 keys from labs [20:23:47] but, yes, I should in future do that. [20:23:52] I used to do it on bast1001 [20:24:00] but I think we killed git from it [20:25:20] also I think git submodule add --reference would probably be an easy win [20:25:32] Cloning core is a bitch [20:25:37] We should have a better solution for this though [20:25:47] ah, nope [20:25:51] thcipriani: bast1001 has git [20:26:10] but key forwarding... [20:26:30] is disabled for prod instances [20:26:35] do it in labs :D [20:26:45] I wonder where twentyafterfour does it [20:26:47] * Reedy opens a task [20:26:52] yeah, that's what will be done in future. [20:32:16] https://wikitech.wikimedia.org/wiki/Incident_documentation/20151215-swift-syslog [20:33:02] (03PS1) 10Papaul: Add hostname mgmt DNS for kafka200[1-2] Add production DNS for kafka200[1-2] Bug:T121558 [dns] - 10https://gerrit.wikimedia.org/r/259327 (https://phabricator.wikimedia.org/T121558) [20:33:17] where I do what? [20:33:31] branch mediawiki for deployments [20:33:42] thcipriani did it from home, and it took nearly 80 minutes [20:33:48] I do it from my laptop [20:33:55] You got a decent connection? :P [20:34:08] and it takes forever, but yeah I have a very fast connection [20:34:12] I used to do it from the cluster, and it was pretty quick, because I can't get a decent connection [20:34:18] hell it takes forever even doing it from production [20:34:43] It used to take < 20 minutes doing it from bast1001 [20:34:47] tin has an outdated version of git which sucks [20:34:58] use mira ;) [20:35:07] tin isn't far off being upgraded [20:35:55] mira might work [20:37:16] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882383 (10Papaul) [20:37:58] !log ori@tin Synchronized php-1.27.0-wmf.8/includes/api/ApiStashEdit.php: ad2e2aeedc: Make edit stashing use named DB locks (duration: 01m 46s) [20:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:09] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881702 (10Papaul) @RobH for the install-module update section what type of RAID level will i be using? [20:42:12] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882411 (10RobH) These systems are 4 * 3TB disks, as such need to make use of that disk space and use GPT. raid10-gpt.cfg should be used for these, as it sets up t... [20:42:33] papaul: wait [20:42:40] why did you snag that? it was assigned to me ;] its ok [20:42:44] andre__: I got a little problem at phab: We have so many projects with security, that the "real" security is not shown at the dropdown, so we can't add it to a task / project [20:42:53] i planned to do this afternoon but if you are doing, keep it up! [20:43:06] robh: sorry [20:43:07] Luke081515: Type "Sec", and it will show up [20:43:26] papaul: i had to add a few more things is all, but its totally cool [20:43:34] csteipp: Thanks :) [20:43:38] Agree, it's painful... [20:43:38] I updated for your question and then i realized what i was answering, heh [20:43:48] robh: saw that [20:43:51] papaul: i'm going to assign to you and you can take it back over until service/key signs! [20:44:01] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882426 (10RobH) a:5RobH>3Papaul [20:44:01] robh: ok [20:44:09] sorry for confusion, i got sidetracked mid-task [20:44:09] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882427 (10Ottomata) I think that will be fine, but be warned that I may need to delete the /srv partition and recreate JBOD for Kafka. TBD. / ext RAID10 across... [20:44:31] (03PS1) 10Dzahn: torrus: fix http(s) monitoring [puppet] - 10https://gerrit.wikimedia.org/r/259332 [20:44:33] I'll take back over the racking task and setup the ports [20:44:40] robh: ok [20:45:38] (03PS2) 10Dzahn: torrus: fix http(s) monitoring [puppet] - 10https://gerrit.wikimedia.org/r/259332 (https://phabricator.wikimedia.org/T119582) [20:46:08] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1882446 (10Papaul) 5Open>3Resolved I update the iLO Firmware from 1.5 to 2.30 on db20[3-4][0-9] . ssh is working now for those systems . I am closing this task.... [20:46:09] papaul: when you list off the ports, also list off the rack/row please for clarity reference: https://phabricator.wikimedia.org/T120885 [20:46:17] i have to look up each asset tag now and see what row its in to confirm [20:46:27] (03PS3) 10Dzahn: torrus: fix http(s) monitoring [puppet] - 10https://gerrit.wikimedia.org/r/259332 (https://phabricator.wikimedia.org/T119582) [20:46:31] (not a big deal) [20:46:46] robh: ok [20:46:47] I imagine we'll have you doing this step soon ehough ;] [20:47:00] (03CR) 10Dzahn: "root@neon:/usr/lib/nagios/plugins# ./check_http -H torrus.wikimedia.org -I misc-web-lb.wikimedia.org -S -u "/torrus" -s "Wiki"" [puppet] - 10https://gerrit.wikimedia.org/r/259332 (https://phabricator.wikimedia.org/T119582) (owner: 10Dzahn) [20:47:44] (03CR) 10Dzahn: [C: 032] "this adds a generic check command that can be used by others to check for strings on https URLs when behind misc-web" [puppet] - 10https://gerrit.wikimedia.org/r/259332 (https://phabricator.wikimedia.org/T119582) (owner: 10Dzahn) [20:49:48] 6operations, 10ops-codfw, 5Patch-For-Review: codfw: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1882479 (10RobH) [20:49:50] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1882478 (10RobH) [20:50:10] 6operations, 10ops-codfw, 5Patch-For-Review: codfw: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1864130 (10RobH) [20:50:12] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882480 (10RobH) [20:50:54] kafka2001 port is done, working on 2002 now [20:51:12] 6operations, 10RESTBase: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1882492 (10GWicke) 3NEW [20:53:53] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882529 (10RobH) kafka2001 & kakfa2002 port descriptions and vlans updated. [20:54:02] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882532 (10RobH) [20:54:11] robh: can you please double check i think those systems have 4TB and not 3TB [20:54:16] 6operations, 10RESTBase: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1882535 (10GWicke) [20:54:21] papaul: hold up on the partitioning [20:54:28] we have an issue on mount points [20:54:40] ottomata: we prefer that we mount all our data on /srv, these dont do that? [20:54:48] 6operations, 10RESTBase: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1882492 (10GWicke) [20:54:50] cmjohnson1: ^ this affects your install of kafka1001-1002 [20:54:58] robh, thus far, kafka and hadoop use jbod for data dirs [20:55:02] so it might not be raid [20:55:04] ahh, similar to swift then [20:55:11] well, should we just use two of the 4 and leave the other two? [20:55:25] k [20:55:28] i was just making assumptions about shit i didnt understand ;] [20:55:31] well, i have seen some recommendations to do raid though, and i think for these lower volume prod servers it might be good to [20:55:45] well, we like to raid the OS so a disk dying doesnt offline the system [20:55:46] and, raid10 on 4 drives might be nice :) [20:55:51] especially since there are only 2 brokers in this cluster [20:56:01] having some disk tolerance will be helpful i think [20:56:04] yes [20:56:06] def raid the os [20:56:21] Luke081515, https://phabricator.wikimedia.org/T109968 [20:56:22] well, i mean if you plan to set two of the 4 disks to JBOD for kafka [20:56:28] then a raid1 of the os on the other 2 seems more sane. [20:56:43] and changing this stuff later typically means reinstall so trying to save a wasted cycle [20:56:49] knowing that it has to be done and online this week or its halted =[ [20:56:58] i would say do / raid1(0) on 50G of 2 or 4, and then we'd figure out how to use the rest of the space on each of those f4 [20:57:03] id' like to use those 4 [20:57:12] robh, for the kafka brokers, i was not able to get a sane parman recipe to work [20:57:14] ok, lemme take a look at potential raid recipies [20:57:27] robh [20:57:31] i'd say, for now [20:57:32] ottomata: oh, should we leave the install step to you and manually setup? [20:57:36] let's go with what you have [20:57:42] maybe with ext4 instead of xfs though [20:57:45] since that's what we use on the other brokers [20:57:55] if it turns out we have to change, i can do it manually, ja [20:58:29] hrmm, that means a new recipie [20:58:40] oh, for ext4? lemme look [20:58:44] its easy [20:58:54] im doing now =] [20:59:00] yeah, lets stick with it [20:59:04] ext4 [20:59:05] ok [21:00:49] (03PS1) 10Ottomata: Release 2.1.1-1 with support for librdkafka and python3 [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/259335 [21:00:57] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1882606 (10GWicke) @JAllemandou: It will likely reduce compaction costs, but the effect should be smaller for you. Upgrading is a matter of apt-get install cassandra, leaving the config untouched. Keyspace... [21:01:05] ottomata: are they ext4 on all partitions on other brokers? [21:01:13] i'll change the ext3 os to ext4 if so [21:01:23] root is ext3 there, but i don't care [21:01:24] ext4 is fine [21:01:31] lets try ext4, see if we run into issues [21:01:32] ? [21:01:39] k, also, for ext4 robh [21:01:43] defaults,noatime,data=writeback,nobh,delalloc [21:01:55] for the /srv [21:02:05] in fstab [21:02:34] looking for where thats set i dont see it in other recipes [21:02:50] (03PS1) 10Thcipriani: Add 1.27.0-wmf.9 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259336 [21:02:52] (03PS1) 10Thcipriani: group0 to php-1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259337 [21:04:28] (03PS1) 10RobH: adding a new gpt recipie for new kafka systems [puppet] - 10https://gerrit.wikimedia.org/r/259339 [21:04:29] There is a save timing regression, but since the train has already gone out to all wikis, probably no need to block the train. [21:04:30] papaul: so now its different, raid10-gpt-srv-ext4.cfg [21:04:37] you'll need to add a line for it in the netboot.cfg file [21:04:40] robh: ok [21:04:41] maybe the train will fix it, or make it worse, will monitor [21:04:45] im merging the new one live now [21:05:27] cool, thanks yalls [21:05:33] thx for the quick replies [21:05:34] (03PS2) 10RobH: adding a new gpt recipie for new kafka systems [puppet] - 10https://gerrit.wikimedia.org/r/259339 [21:05:54] (03CR) 10RobH: [C: 032] adding a new gpt recipie for new kafka systems [puppet] - 10https://gerrit.wikimedia.org/r/259339 (owner: 10RobH) [21:07:14] ok https://gerrit.wikimedia.org/r/#/c/259339/ is live [21:07:20] papaul you can reference it in your file [21:07:23] (03CR) 10Thcipriani: [C: 032] "Train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259336 (owner: 10Thcipriani) [21:07:27] robh: thanks [21:07:49] (03Merged) 10jenkins-bot: Add 1.27.0-wmf.9 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259336 (owner: 10Thcipriani) [21:08:43] (03CR) 10RobH: [C: 032] Add hostname mgmt DNS for kafka200[1-2] Add production DNS for kafka200[1-2] Bug:T121558 [dns] - 10https://gerrit.wikimedia.org/r/259327 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [21:09:08] and dns changes are live for kafka2001-2002 [21:09:20] ok, gotta run down the block and snag laundry, brb [21:09:49] !log thcipriani@tin Started scap: testwiki to php-1.27.0-wmf.9 and rebuild l10ncache [21:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:29] 6operations, 7HTTPS, 5Patch-For-Review: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1882743 (10Dzahn) fixed: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon1001&service=torrus.wikimedia.org+UI [21:22:46] 6operations, 7HTTPS: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1882744 (10Dzahn) [21:24:19] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1882758 (10Cmjohnson) [21:24:46] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1882763 (10Cmjohnson) [21:25:21] (03PS1) 10Papaul: Add kafka200[1-2] partition entrie Bug:T121558 [puppet] - 10https://gerrit.wikimedia.org/r/259391 (https://phabricator.wikimedia.org/T121558) [21:29:27] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1882786 (10RobH) [21:29:30] 6operations, 10ops-codfw, 5Patch-For-Review: codfw: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1882781 (10RobH) 5Open>3Resolved All of these network ports have been updated with the descriptions, resolving. [21:35:53] (03PS1) 10Papaul: Add MAC address entries for kafak200[1-2] Bug:T121558 [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) [21:37:44] (03PS1) 10Ottomata: Disable etcd in labs eventlogging for now [puppet] - 10https://gerrit.wikimedia.org/r/259399 [21:39:04] (03PS1) 10Cmjohnson: Adding dns entries for kafka1001/2 and ores1001/2 [dns] - 10https://gerrit.wikimedia.org/r/259401 [21:39:36] (03CR) 10Ottomata: [C: 032] Disable etcd in labs eventlogging for now [puppet] - 10https://gerrit.wikimedia.org/r/259399 (owner: 10Ottomata) [21:39:47] !log thcipriani@tin Finished scap: testwiki to php-1.27.0-wmf.9 and rebuild l10ncache (duration: 29m 58s) [21:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:40:59] (03PS2) 10Papaul: Add MAC address entries for kafka200[1-2] Bug:T121558 [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) [21:45:23] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882829 (10Papaul) We discussed this on IRC that the following RAID level need to be used raid10-gpt-srv-ext4.cfg [21:46:04] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1882830 (10Papaul) [21:47:22] (03CR) 10Thcipriani: [C: 032] "Train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259337 (owner: 10Thcipriani) [21:50:28] sigh, zuul not picking this up for some reason. [21:50:34] (03Merged) 10jenkins-bot: group0 to php-1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259337 (owner: 10Thcipriani) [21:50:40] oh, nevermind. [21:53:11] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.9 [21:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:33] 1.27.0-wmf.9 group0 Train complete. [21:55:09] whoa, ton of these in the logs: Notice: Undefined variable: wgExtractsRemoveClasses in /srv/mediawiki/wmf-config/CommonSettings.php on line 2153 [21:56:34] Ugh. [21:56:55] extensions conversion? [21:56:58] That'll be https://gerrit.wikimedia.org/r/#/c/258065/ [21:56:58] Yes [21:57:00] legoktm, ^ [21:57:08] (03PS2) 10Cmjohnson: Adding dns entries for kafka1001/2 and ores1001/2 [dns] - 10https://gerrit.wikimedia.org/r/259401 [21:57:34] lol [21:59:34] from -releng: [21:59:38] Dec 11 00:30:05 what [21:59:38] Dec 11 00:30:42 that warning comes up every time you run eval.php [21:59:38] Dec 11 00:31:19 $wgExtractsRemoveClasses = array_merge( $wgExtractsRemoveClasses, $wmgExtractsRemoveClasses ); [22:00:07] um, sorry, warning was: Dec 11 00:30:04 Warning: array_merge(): Argument #1 is not an array in /mnt/srv/mediawiki/wmf-config/CommonSettings.php on line 2137 deployment-db2 [22:00:14] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for kafka1001/2 and ores1001/2 [dns] - 10https://gerrit.wikimedia.org/r/259401 (owner: 10Cmjohnson) [22:00:16] Dec 11 00:43:36 Krenair: ohhhh, that's me. I think I just broke that. [22:00:16] Dec 11 00:44:18 Did you just change it to extension registration? [22:00:16] Dec 11 00:45:11 Yep: https://gerrit.wikimedia.org/r/#/c/258065/ [22:00:16] Dec 11 00:45:12 gj [22:00:16] Dec 11 00:45:18 better fix it before the next branch cut [22:01:11] 6operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1882881 (10RobH) 3NEW a:3RobH [22:03:20] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1882893 (10RobH) So of the new spares, I've allocated WMF6381 for this task. [22:04:22] (03PS3) 10Rush: Set idle_timelimit for nslcd [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [22:05:05] 6operations: setup/deploy WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1882900 (10RobH) 3NEW a:3RobH [22:05:35] !log restbase: canary deploy of 3b7ae07e to restbase1001 [22:05:38] (03CR) 10Rush: [C: 031] "ran this on a VM for most of the day, no occurrences after hotfixing. seems legit." [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [22:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:54] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1882912 (10RobH) 5Open>3Resolved As both of these have been approved, tasks T121582 and T121583 have been created for installations. Resolving this request. [22:06:38] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1882919 (10RobH) [22:06:43] /quit [22:07:39] !log restbase: starting full deploy of 3b7ae07e to restbase prod cluster [22:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:45] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:12:19] hmm, seems sync-wikiversions didn't sync with mira. [22:12:36] did it error? [22:12:49] run it again with verbose? [22:13:07] it did not error. [22:13:33] (03PS1) 10RobH: setting einsteinium.wikimedia.org dns entry [dns] - 10https://gerrit.wikimedia.org/r/259407 [22:13:49] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.9 mira test [22:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:09] !log restbase: finished full deploy of 3b7ae07e to restbase prod cluster [22:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:32] (03CR) 10RobH: [C: 032] setting einsteinium.wikimedia.org dns entry [dns] - 10https://gerrit.wikimedia.org/r/259407 (owner: 10RobH) [22:15:16] Reedy: running with -v didn't give any more info, also mira is still not up-to-date :\ [22:15:44] mm [22:16:08] bd808: is it -v or --verbose? [22:16:26] seems to be stuck at the last scap I ran (to update wikiversions for testwiki only). [22:16:47] I think both? but --verbose for sure (for commands that acutally support it) [22:17:03] https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/cli.py#L133 [22:17:47] if mira is stuck [22:19:32] thcipriani: I see the problem. sync-wikiversions is "special" and overrides AbstractSync::main [22:19:45] so it doesn't sync-master? [22:19:47] so it doesn't sync with the co-masters [22:19:51] yeah, seems like it just needs to concat the master list [22:19:53] boooooo [22:20:03] ottomata: do you want trusty or jessie on kafka1001/2? [22:22:37] hmmm, but it seems like should still be using AbstractSync::_get_target_list which does include the master list :\ [22:22:41] * thcipriani makes task [22:22:58] (03PS1) 10RobH: setting einsteinium install params [puppet] - 10https://gerrit.wikimedia.org/r/259409 [22:24:01] (03CR) 10RobH: [C: 032] setting einsteinium install params [puppet] - 10https://gerrit.wikimedia.org/r/259409 (owner: 10RobH) [22:26:17] (03PS3) 10RobH: Add MAC address entries for kafka200[1-2] Bug:T121558 [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:27:43] anyone willing to make copies os /var/log/upstart/jobchron.log* into my home directory on some mediawiki job runner? Apparently this is where the logs for the service that undelays jobs in the job queue go, but the files are root:root and 0640 [22:28:51] (03PS1) 10Yuvipanda: sentry: Stop using LDAP variables for hostname matching [puppet] - 10https://gerrit.wikimedia.org/r/259412 (https://phabricator.wikimedia.org/T101447) [22:28:59] tgr: ^ just a fyi [22:29:08] I'll babysit to make sure it's a noop [22:29:34] AaronSchulz: willing to make copies os /var/log/upstart/jobchron.log* into my home directory on some mediawiki job runner? Apparently this is where the logs for the service that undelays jobs in the job queue go, but the files are root:root and 0640. I'm trying to debug several thousand cirrussearch jobs currently marked as abandoned [22:29:46] ebernhardson: if nobody gets to you in like, 10 minutes I can do that [22:29:51] ShiveringPanda: thanks [22:32:07] (03CR) 10Yuvipanda: [C: 032] sentry: Stop using LDAP variables for hostname matching [puppet] - 10https://gerrit.wikimedia.org/r/259412 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [22:33:02] Iterator page I/O error: An unknown error occurred in storage backend "global-swift-eqiad". [22:33:23] ebernhardson: can you give me a specific job runner? [22:33:25] last hit at 2015-12-15T19:38:34.000Z okay [22:33:39] chasemp: mw1001? i don't know if that will have the files i need because i can't read the logs :) [22:33:51] * Krinkle changes mediawiki-errors dashboad in kibana back to defualting to 15min ago [22:34:02] but its a start :) i'm working up a patch to put these logs in /var/log/mediawiki with the regular jobrunner logs [22:34:42] (03PS4) 10RobH: Add MAC address entries for kafka200[1-2] Bug:T121558 [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:35:14] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [22:35:14] @info 10.64.16.157 [22:35:18] chasemp: thanks [22:35:44] chasemp: oh but can you chown them to me so i can read them :) [22:35:56] already done :) [22:35:57] (03PS1) 10Cmjohnson: Adding dhcpd entries for kafka1001/2 and ores1001/2 [puppet] - 10https://gerrit.wikimedia.org/r/259416 [22:36:01] you were spying too early mate [22:36:04] chasemp: perfect, thanks! [22:36:42] (03PS5) 10RobH: adding install params for kafka200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:36:45] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entries for kafka1001/2 and ores1001/2 [puppet] - 10https://gerrit.wikimedia.org/r/259416 (owner: 10Cmjohnson) [22:37:49] (03CR) 10RobH: [C: 032] adding install params for kafka200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:38:10] (03Abandoned) 10RobH: Add kafka200[1-2] partition entrie Bug:T121558 [puppet] - 10https://gerrit.wikimedia.org/r/259391 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:38:24] 6operations, 10DBA, 7Wikimedia-log-errors: Job runners throwing "Can't connect to MySQL server" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1883047 (10Krinkle) [22:38:37] (03PS6) 10RobH: adding install params for kafka200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:38:56] (03CR) 10RobH: [C: 032] adding install params for kafka200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/259398 (https://phabricator.wikimedia.org/T121558) (owner: 10Papaul) [22:40:24] 6operations, 10DBA, 7Wikimedia-log-errors: Job runners throwing "Can't connect to MySQL server" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1485888 (10Krinkle) Also happening about once every few minutes for parsercache servers Last 3 hours in requests to /rpc/RunJobs.php on mediawiki-er... [22:42:21] thcipriani: Is the train done? Need to backport a patch to unbreak VE for mobile users in wmf.8 [22:42:34] Krinkle: the train is complete. [22:42:39] Thanks [22:43:23] aude: I expect to be done before you start, but just FYI, I'm hotpatching CX to unbreak VE on mobile. [22:44:51] chasemp: any idea what could be wrong with https://gerrit.wikimedia.org/r/#/c/259412/ [22:45:02] it's causing the value to be written out verbatim instead of interpolated [22:45:29] yeah hang on a sec [22:45:40] 6operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1883056 (10RobH) [22:46:48] ShiveringPanda: in the spirit of puppet is insane [22:46:56] the var leadin is % not $ [22:46:57] for hiera [22:47:07] "%{::ipaddress_eth0}" [22:47:14] (03PS1) 10Yuvipanda: ssh: Disallow optional X forwarding [puppet] - 10https://gerrit.wikimedia.org/r/259420 (https://phabricator.wikimedia.org/T101447) [22:47:17] ofcourse [22:47:50] (03PS1) 10Yuvipanda: sentry: Use correct incantation to summon the devil [puppet] - 10https://gerrit.wikimedia.org/r/259421 [22:47:50] thanks chasemp [22:48:00] (03PS2) 10Yuvipanda: ssh: Disallow optional X forwarding [puppet] - 10https://gerrit.wikimedia.org/r/259420 (https://phabricator.wikimedia.org/T101447) [22:48:28] (03CR) 10Yuvipanda: [C: 032 V: 032] ssh: Disallow optional X forwarding [puppet] - 10https://gerrit.wikimedia.org/r/259420 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [22:49:26] (03PS2) 10Yuvipanda: sentry: Use correct incantation to summon the devil [puppet] - 10https://gerrit.wikimedia.org/r/259421 [22:49:41] (03CR) 10Yuvipanda: [C: 032 V: 032] sentry: Use correct incantation to summon the devil [puppet] - 10https://gerrit.wikimedia.org/r/259421 (owner: 10Yuvipanda) [23:00:05] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151215T2300). Please do the needful. [23:01:27] Krinkle: no hurry [23:01:41] CX tests are really slow because they include all of Scribunto for some reason. [23:03:08] hey bblack https://en.m.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Peacock_butterfly looks stale while https://en.m.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Peacock_butterfly does not. any insight? [23:03:31] oops, i meant while desktop's https://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Peacock_butterfly does not look stale [23:03:45] bblack: ie mobile webpage stale, desktop not stale [23:10:25] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:11:40] I synced to wmf.9 [23:11:45] waiting for wmf.8 merge to sync there [23:11:48] Not sure why it didn't log [23:12:14] 6operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1883183 (10RobH) [23:12:33] (03PS1) 10Cmjohnson: Adding new kafka1001/2 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/259423 [23:12:35] Krinkle: there are quite a lot of notices and warnings in the hhvm log [23:12:38] like Notice: Undefined variable: wgExtractsRemoveClasses in /srv/mediawiki/wmf-config/CommonSettings.php on line 2153 [23:12:51] is that related to what you are fixing? [23:13:15] !log Synchronized php-1.27.0-wmf.9/extensions/ContentTranslation/extension.json 'T121308 - unbreak VE' [23:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:20] aude: No, I'm not touching wmf-config [23:13:24] aude: It's a known issue [23:13:38] I don't know if legoktm was going to fix it [23:13:43] Reedy: ok [23:13:44] (fallout from converting an extension) [23:14:02] it's quite a lot of spam, but as long as it's known and not horribly breaking anything [23:14:38] (03CR) 10Cmjohnson: [C: 032] Adding new kafka1001/2 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/259423 (owner: 10Cmjohnson) [23:15:11] i don't see a task for this [23:16:14] there https://phabricator.wikimedia.org/T121592 [23:16:14] I'm sure there was one [23:16:20] couldnt find it [23:16:23] It was linked earlier [23:16:42] ok if someone finds it and marks this as duplicate [23:17:27] aude: Done :) [23:17:44] Krinkle: ok, thanks [23:19:18] (03CR) 10Aude: [C: 032] Enable Wikidata data access for meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259218 (https://phabricator.wikimedia.org/T117524) (owner: 10Aude) [23:19:36] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly - https://phabricator.wikimedia.org/T121594#1883261 (10dr0ptp4kt) [23:22:11] (03Merged) 10jenkins-bot: Enable Wikidata data access for meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259218 (https://phabricator.wikimedia.org/T117524) (owner: 10Aude) [23:24:19] !log aude@tin Synchronized dblists/arbitraryaccess.dblist: Enabling Wikidata data access for meta-wiki (duration: 00m 30s) [23:24:22] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1883271 (10Milimetric) @Dzahn & @Joe: Yes, this instance will need a database. And I agree with @Joe that if any... [23:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:40] !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enabling Wikidata data access for meta-wiki (duration: 00m 29s) [23:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:27] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enabling Wikidata data access for meta-wiki (duration: 00m 30s) [23:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:38] 6operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1883273 (10RobH) [23:27:05] 6operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1883283 (10RobH) a:5RobH>3akosiaris Assigning to Alex for service implementation. [23:27:39] done :) [23:27:43] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1883286 (10yuvipanda) This should probably be isolated in its own vlan or some such as well, since I strongly sus... [23:32:03] (03PS1) 10Legoktm: Fix merging of $wgExtractsRemoveClasses post-extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259425 (https://phabricator.wikimedia.org/T121592) [23:43:26] (03CR) 10Gergő Tisza: "Thanks! That's certainly more convenient." [puppet] - 10https://gerrit.wikimedia.org/r/259412 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [23:43:59] (03CR) 10Yuvipanda: "yw!" [puppet] - 10https://gerrit.wikimedia.org/r/259412 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [23:55:15] (03CR) 10Krinkle: [C: 031] Fix merging of $wgExtractsRemoveClasses post-extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259425 (https://phabricator.wikimedia.org/T121592) (owner: 10Legoktm)