[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T0000). Please do the needful. [00:00:05] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:22] can do [00:00:38] matt_flaschen, yt? [00:00:45] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3220186 keys, up 92 days 15 hours - replication_delay is 630 [00:00:45] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3215623 keys, up 92 days 15 hours - replication_delay is 0 [00:01:27] MaxSem, present. [00:01:45] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3215742 keys, up 92 days 15 hours - replication_delay is 0 [00:02:24] (03PS1) 10Dzahn: move install1002 to lower free IP address [dns] - 10https://gerrit.wikimedia.org/r/335379 (https://phabricator.wikimedia.org/T132757) [00:02:53] matt_flaschen, I'm also here. [00:02:57] Commit [00:02:57] [00:03:04] sorry, bad paste [00:03:52] so the dependency is already live? [00:04:30] (03CR) 10Dzahn: "thanks! following up to use a lower free IP actually" [dns] - 10https://gerrit.wikimedia.org/r/335376 (https://phabricator.wikimedia.org/T156440) (owner: 10Dzahn) [00:04:39] Thanks, foks [00:04:48] * foks is Joe, fyi [00:04:54] MaxSem, it should be, double-checking. [00:06:50] MaxSem, yes, https://gerrit.wikimedia.org/r/#/c/335263/ is merged, MW.org is on 1.29.0-wmf.10, looks good. [00:07:44] (03CR) 10MaxSem: [C: 032] CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) (owner: 10AndyRussG) [00:07:57] (03PS4) 10MaxSem: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) (owner: 10AndyRussG) [00:08:21] (03CR) 10MaxSem: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) (owner: 10AndyRussG) [00:08:27] (03CR) 10MaxSem: [C: 032] CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) (owner: 10AndyRussG) [00:08:36] MaxSem: :) [00:09:30] (03Merged) 10jenkins-bot: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) (owner: 10AndyRussG) [00:09:39] (03CR) 10jenkins-bot: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) (owner: 10AndyRussG) [00:11:13] matt_flaschen, AndyRussG, foks - live on mwdebug1002, please test [00:12:12] matt_flaschen: I just saw that meta is in group 1, not group 0. So the correspoding change isn't available yet for targeting campaigns [00:12:32] i.e. for one day, no campaigns will go out to mediawiki.org [00:12:39] not working here, indeed [00:12:57] Dunno why, I was thinking that meta was in group 0 [00:13:19] AndyRussG, oh, darn. I thought it was enough to have the target wiki be group 0. [00:13:25] I forgot about the whole Central part, since it's so subtle. :( [00:13:34] heh [00:13:46] matt_flaschen: MaxSem: until meta is updated to wmf10, there'll be no way to test [00:13:58] Oh, that might be why it's not working [00:13:59] deploy to meta? [00:14:24] MaxSem: well, what needs to go to meta is a different change, that went out on the train today [00:14:27] We could backport it [00:14:30] It's actually trivial [00:15:17] (03PS1) 10Dzahn: planet: use one line per include, full class names [puppet] - 10https://gerrit.wikimedia.org/r/335381 [00:15:18] MaxSem: all you'd have to do is move the pointer for the CN subodule to 7c0c4932dc410d9809cfc22725b3985639f961a2 [00:15:23] AndyRussG, it's up to you. I will be around to help monitor, if you're comfortable. Let's check if it requires a scap or something, though. [00:15:44] (i.e., the most recent commit on CN's wmf_deploy branch) [00:15:46] AndyRussG, I'd rather deploy an explicit gerrit commit [00:15:54] (03CR) 10Dzahn: [C: 032] move install1002 to lower free IP address [dns] - 10https://gerrit.wikimedia.org/r/335379 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:15:58] MaxSem: yes one sec [00:16:16] AndyRussG, when you do that, can you also check if there are i18n changes. [00:17:08] matt_flaschen: there are none [00:17:20] AndyRussG, great, then we can just do sync-dir. [00:17:45] Yeah... I mean, the only change is in extension.json, which I assume has nothing special about it [00:18:09] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2988550 (10Tgr) >>! In T66214#2982093, @Fjalapeno wrote: > One piece of feedback we have been discussing is having a higher level abstraction in additi... [00:18:47] (03CR) 10Dzahn: "install1001.wikimedia.org has address 208.80.154.83" [dns] - 10https://gerrit.wikimedia.org/r/335379 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:19:42] (03CR) 10Dzahn: [C: 032] planet: use one line per include, full class names [puppet] - 10https://gerrit.wikimedia.org/r/335381 (owner: 10Dzahn) [00:21:23] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2988556 (10GWicke) >>! In T66214#2988550, @Tgr wrote: >>>! In T66214#2982093, @Fjalapeno wrote: >> One piece of feedback we have been discussing is hav... [00:23:15] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:24:00] (03PS1) 10Dzahn: wikimania_scholarships, planet: one line per include, full class names [puppet] - 10https://gerrit.wikimedia.org/r/335382 [00:24:45] (03CR) 10Dzahn: [C: 032] wikimania_scholarships, planet: one line per include, full class names [puppet] - 10https://gerrit.wikimedia.org/r/335382 (owner: 10Dzahn) [00:24:58] MaxSem: having some silly issues w/ submodule update... This is the wmf_deploy commit in CN. Unless you're in a rush, I'll hack at it a bit more still [00:25:02] https://gerrit.wikimedia.org/r/#/c/335263/ [00:27:01] Hmmm wmf.9 I think maybe already updated automatically [00:27:15] lemme check [00:27:26] MaxSem: what is the commit head for wmf.9 CN submodule where you are? [00:27:51] If it's 7c0c4932dc410d9809cfc22725b3985639f961a2 we're good to go [00:27:52] (03CR) 10Dzahn: "planet and wikimania_s done - rebased" [puppet] - 10https://gerrit.wikimedia.org/r/334322 (owner: 10Juniorsys) [00:28:00] (03PS4) 10Dzahn: Puppet style: Use one line per include/require [puppet] - 10https://gerrit.wikimedia.org/r/334322 (owner: 10Juniorsys) [00:28:01] (that is, that's what we'd need to deploy) [00:28:15] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:28:24] "Add mediawiki to list of wikis" [00:28:36] yep. updated and pulled on mwdebug1002 again [00:28:43] woohoo! [00:29:15] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:11] MaxSem: K so far so good... CN change is visible there! [00:30:23] cool. sync? [00:30:25] (and works as advertised) [00:30:27] Yep pls [00:31:01] matt_flaschen: we'll need to update all campaigns that were previously targeting the "wikimedia" CN project to add the "mediawiki" project [00:32:07] !log maxsem@tin Synchronized php-1.29.0-wmf.9/extensions/CentralNotice/: https://gerrit.wikimedia.org/r/#/c/335263/ (duration: 00m 58s) [00:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:37] AndyRussG, I meant whether there were any unrelated i18n changes in the meantime, not in this commit. [00:33:12] MaxSem, AndyRussG, do you know what commit 1.29.0-wmf.9 was at before? [00:33:38] AndyRussG, is there a script do that or at least a way to query it (the campaigns)? [00:33:39] matt_flaschen: it's cherry-picked... No, we did a CN update recently [00:33:45] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/334025/ (duration: 00m 40s) [00:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:48] (03CR) 10Dzahn: "@Legoktm also see hashar's comments on https://gerrit.wikimedia.org/r/#/c/334309/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/334284 (owner: 10Juniorsys) [00:33:55] it was 24e8419 [00:34:07] matt_flaschen: we'll just look at https://meta.wikimedia.org/wiki/Special:CentralNotice [00:34:10] (03CR) 10Dzahn: "@hashar also see Legoktm's comments on https://gerrit.wikimedia.org/r/#/c/334309/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:34:10] please test ^^^ [00:34:44] MaxSem, seems to be working [00:34:49] (03CR) 10Dzahn: "wrong link. on https://gerrit.wikimedia.org/r/#/c/334284/6" [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:34:51] in the list, anyway [00:34:52] wee [00:34:58] one second and I will see if it works properly [00:35:16] MaxSem: matt_flaschen: yeah there are no automatic updates to the wmf_deploy branch. The last commit on that branch was another merge from master to deploy that went on last week's train [00:35:18] AndyRussG, looks good (only extension.json was affected, thanks). [00:35:33] MaxSem: did the config change go out too? [00:35:43] yep^ [00:36:05] K... [00:36:13] (03PS5) 10Dzahn: jupyterhub/keyholder: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334287 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:36:15] Lemme try updating at one campaign [00:36:21] matt_flaschen: there aren't that many active ones [00:36:27] (03PS6) 10Dzahn: jupyterhub/keyholder: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334287 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:36:33] (03CR) 10Dzahn: [C: 032] jupyterhub/keyholder: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334287 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:37:21] I am using the campaign on tech coc [00:37:29] AndyRussG, ^ [00:37:45] Oh, I missed a checkbox [00:37:51] foks: MaxSem matt_flaschen also note that changes to a campaign may take up to 10 minutes to go live because caching [00:37:52] that explains why it wasn't working... [00:38:01] Yeah, I was also really dumb [00:38:10] It should be working once cache catches up [00:40:27] MaxSem, works! [00:40:33] \m/ [00:40:43] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/5298/" [puppet] - 10https://gerrit.wikimedia.org/r/334279 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:40:53] (03PS6) 10Dzahn: dnsrecursor: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334279 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:42:31] MaxSem: the config change is out everywhere, right? [00:42:43] yes [00:43:57] K [00:43:59] thx [00:45:18] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5299/" [puppet] - 10https://gerrit.wikimedia.org/r/334278 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:45:27] (03PS6) 10Dzahn: deployment: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334278 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:46:33] foks: the coc campaign isn't enabled. Also, do you need 100% sample rate for recording impressions? Normally "Legacy hiding and impression counting support" should be off [00:46:54] (unless the banner contains special code to hide/show itself under unique conditions) [00:46:57] AndyRussG, I enabled it briefly to test. [00:47:09] and yes, I'm not sure who enabled that. [00:47:11] foks: ah OK... That's why I only saw it once! [00:47:20] I have impression diet on, which should be good enough [00:47:38] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2988603 (10jcrespo) 36 pending hosts: ``` db1030.eqiad.wmnet: NULL db1045.eqiad.wmnet: NULL db1020.eqiad.wmnet: NULL db1001.eqiad.wmnet: NULL db1039.eqiad.wmnet: NULL db1026.eqiad.wmnet: NU... [00:47:43] Though you are right, it probably doesn't need to be 100%. [00:47:45] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:48:06] foks: K! So I'll leave it disabled but will remove the full impression sampling. The default is what's normally used for community campaigns (1% sample rate) [00:48:53] 1%? [00:49:11] that's quite a bit lower than usually used on CN [00:49:21] anyway, this isn't really a discussion for here: ) [00:50:05] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2641.90 Read Requests/Sec=1774.90 Write Requests/Sec=3.30 KBytes Read/Sec=29320.80 KBytes_Written/Sec=66.80 [00:50:29] I'll follow up with you guys. Thanks AndyRussG, foks, MaxSem [00:50:39] foks: every sample recording involves an extra webrequest. You can get quite good data from 1% if u need it. Only Fundraising, which runs short A-B tests to verify the effectiveness of different banners, usually needs 100%. If you need more than 1% pls reach out, it's probably fine if justifiable... Seddon is the one to talk to :) [00:50:45] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3217611 keys, up 92 days 16 hours - replication_delay is 610 [00:50:45] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3217784 keys, up 92 days 16 hours - replication_delay is 610 [00:51:01] sure! [00:51:06] thx! [00:51:11] I've just never heard of using 1% before [00:51:20] anyway... thanks all! [00:51:37] matt_flaschen: foks: MaxSem all good then! I have to be away from the keyboard for about 20 min. When I come back, I'll fix other campaigns that were previously targeting mediawiki.org via the old way [00:51:45] AndyRussG, thanks! [00:52:03] matt_flaschen: yw, and thank u also 4 working on this :D [00:52:05] bassoon [00:52:05] AndyRussG, the 1% is just for analytics, right? Everyone will still see it, except for impression diet? [00:52:13] matt_flaschen: correct [00:52:23] The impression diet is what controls who sees it [00:52:34] AndyRussG, great, thank you again. [00:52:34] (03PS3) 10Dzahn: ganglia: display deprecation banner [puppet] - 10https://gerrit.wikimedia.org/r/331097 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [00:52:41] (03CR) 10Dzahn: [C: 032] ganglia: display deprecation banner [puppet] - 10https://gerrit.wikimedia.org/r/331097 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [00:53:15] The impression recording is an extra data point sent back for every user selected in a campaign (or a % of them, based on sample rate) to tell us for sure what happened with the banner. Doesn't affect users at all [00:53:22] cya! [00:53:45] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3215601 keys, up 92 days 16 hours - replication_delay is 0 [00:54:45] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3215504 keys, up 92 days 16 hours - replication_delay is 0 [00:56:17] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2637292 (10Dzahn) Ganglia now shows a deprecation banner. {F5436683} [00:57:48] !log Ganglia is now deprecated in favor of Grafana (https://phabricator.wikimedia.org/T145659#2925104) [00:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:15] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:07:05] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=131.50 Read Requests/Sec=163.90 Write Requests/Sec=0.30 KBytes Read/Sec=2946.80 KBytes_Written/Sec=12.00 [01:15:26] !log ganeti: install1001 - remove virtual disk 1 from instance | create instance install1002 instead (T132757) [01:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:31] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [01:15:45] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:19:04] !log ganeti: create instance install2002 with 80G disk, 2G RAM (T156440) [01:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:11] T156440: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440 [01:39:46] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:06:47] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:38] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.266 second response time [02:08:45] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:16:59] (03CR) 10Chad: [C: 032] ssh.job.progress changed: now takes ProgressReporter, not str [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335366 (owner: 1020after4) [02:17:44] (03PS1) 10Dzahn: DHCP: add install1001,install2001 [puppet] - 10https://gerrit.wikimedia.org/r/335386 (https://phabricator.wikimedia.org/T156440) [02:18:18] (03Merged) 10jenkins-bot: ssh.job.progress changed: now takes ProgressReporter, not str [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335366 (owner: 1020after4) [02:18:27] (03CR) 10jenkins-bot: ssh.job.progress changed: now takes ProgressReporter, not str [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335366 (owner: 1020after4) [02:20:42] !log demon@tin Synchronized scap/plugins/clean.py: no-op (duration: 00m 39s) [02:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:51] (03CR) 10Tim Landscheidt: [C: 031] toollabs: Install mktorrent [puppet] - 10https://gerrit.wikimedia.org/r/334962 (https://phabricator.wikimedia.org/T155470) (owner: 10Legoktm) [02:31:02] (03CR) 10Tim Landscheidt: [C: 04-1] labstore: Install package nethogs from jessie-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334218 (owner: 10Madhuvishy) [02:31:30] (03PS2) 10Dzahn: DHCP: add install1001,install2001 [puppet] - 10https://gerrit.wikimedia.org/r/335386 (https://phabricator.wikimedia.org/T156440) [02:31:36] (03CR) 10Dzahn: [C: 032] DHCP: add install1001,install2001 [puppet] - 10https://gerrit.wikimedia.org/r/335386 (https://phabricator.wikimedia.org/T156440) (owner: 10Dzahn) [02:34:02] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.9) (duration: 11m 52s) [02:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:05] !log install1002, install2002 - install jessie, sign puppet certs, initial puppet run (T132757, T156440) [02:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:11] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [02:48:11] T156440: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440 [02:49:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:51:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:51:59] (03CR) 10Dzahn: [C: 031] toollabs: Install mktorrent [puppet] - 10https://gerrit.wikimedia.org/r/334962 (https://phabricator.wikimedia.org/T155470) (owner: 10Legoktm) [02:53:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:53:15] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.277 second response time [02:54:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:54:16] (03CR) 10Dzahn: [C: 031] labs modules linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [02:54:57] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 03m 42s) [02:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:59:08] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [02:59:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:00:18] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:00:28] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:00:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Feb 1 03:00:32 UTC 2017 (duration 5m 35s) [03:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:08] PROBLEM - HTTP on install2002 is CRITICAL: connect to address 208.80.153.53 and port 80: Connection refused [03:01:18] PROBLEM - HTTP on install1002 is CRITICAL: connect to address 208.80.154.86 and port 80: Connection refused [03:01:41] (03PS1) 10Dzahn: installserver: add install1002/2002 to hiera [puppet] - 10https://gerrit.wikimedia.org/r/335388 (https://phabricator.wikimedia.org/T156440) [03:02:28] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [03:03:00] (03CR) 10Dzahn: [C: 032] installserver: add install1002/2002 to hiera [puppet] - 10https://gerrit.wikimedia.org/r/335388 (https://phabricator.wikimedia.org/T156440) (owner: 10Dzahn) [03:05:27] ACKNOWLEDGEMENT - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn T156440 [03:05:27] ACKNOWLEDGEMENT - HTTP on install2002 is CRITICAL: connect to address 208.80.153.53 and port 80: Connection refused daniel_zahn T156440 [03:05:27] ACKNOWLEDGEMENT - NTP on install2002 is CRITICAL: NTP CRITICAL: Offset unknown daniel_zahn T156440 [03:05:27] ACKNOWLEDGEMENT - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] daniel_zahn T156440 [03:05:57] ACKNOWLEDGEMENT - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn T132757 [03:05:57] ACKNOWLEDGEMENT - HTTP on install1002 is CRITICAL: connect to address 208.80.154.86 and port 80: Connection refused daniel_zahn T132757 [03:05:57] ACKNOWLEDGEMENT - NTP on install1002 is CRITICAL: NTP CRITICAL: Offset unknown daniel_zahn T132757 [03:05:57] ACKNOWLEDGEMENT - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 38 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] daniel_zahn T132757 [03:13:18] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2988743 (10Krinkle) [03:14:32] (03PS1) 10Dzahn: installserver::nginx: use "do_acme" in Hiera instead of custom var [puppet] - 10https://gerrit.wikimedia.org/r/335389 [03:15:41] (03PS2) 10Dzahn: installserver::nginx: use "do_acme" in Hiera instead of custom var [puppet] - 10https://gerrit.wikimedia.org/r/335389 [03:16:50] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:26] (03PS3) 10Dzahn: installserver::nginx: use "do_acme" in Hiera instead of custom var [puppet] - 10https://gerrit.wikimedia.org/r/335389 [03:18:40] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.276 second response time [03:20:18] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.131 second response time [03:22:18] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.87 seconds [03:27:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.82 seconds [03:39:14] (03CR) 10Dzahn: [C: 032] installserver::nginx: use "do_acme" in Hiera instead of custom var [puppet] - 10https://gerrit.wikimedia.org/r/335389 (owner: 10Dzahn) [03:39:17] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2988770 (10bd808) [03:40:40] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10bd808) [03:54:08] (03CR) 10Aaron Schulz: [C: 031] Remove aaron from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/335011 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [03:54:20] (03CR) 10Aaron Schulz: [C: 031] Remove aaron from contint-users [puppet] - 10https://gerrit.wikimedia.org/r/335014 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [04:08:08] RECOVERY - HTTP on install2002 is OK: HTTP OK: HTTP/1.1 200 OK - 244 bytes in 0.073 second response time [04:08:18] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [04:08:48] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:09:18] RECOVERY - HTTP on install1002 is OK: HTTP OK: HTTP/1.1 200 OK - 244 bytes in 0.002 second response time [04:09:28] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [04:10:08] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [04:15:00] (03PS2) 10Dzahn: Remove aaron from contint-users [puppet] - 10https://gerrit.wikimedia.org/r/335014 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [04:17:14] !log carbon - rsyncing entire /srv over to install2002 (T156440) [04:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:21] T156440: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440 [04:18:40] (03CR) 10Dzahn: [C: 032] Remove aaron from contint-users [puppet] - 10https://gerrit.wikimedia.org/r/335014 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [04:19:28] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:21:41] (03PS2) 10Dzahn: Remove aaron from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/335011 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [04:23:23] (03CR) 10Dzahn: [C: 032] Remove aaron from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/335011 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [04:45:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [04:46:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [04:47:28] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:52:50] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:40] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.272 second response time [04:56:50] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [05:00:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [05:04:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.372 second response time [05:06:38] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [05:07:09] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:07:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [05:07:38] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [05:08:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [05:12:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 6.973 second response time [05:19:09] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:20:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [05:20:38] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [05:21:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [05:23:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.002 second response time [05:23:38] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [05:27:09] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.002 second response time [05:34:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:34:59] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 3.647 second response time [05:41:38] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [05:42:38] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [05:48:38] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89943.29 seconds [05:49:09] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:50:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 7.962 second response time [06:00:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:01:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.895 second response time [06:02:42] godog: if you're around ^ [06:04:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [06:05:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [06:07:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.112 second response time [06:10:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:11:38] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [06:13:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.316 second response time [06:13:38] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [06:24:29] increased traffic by 200Mbps ... https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thumbor1001&var-network=eth0&from=now-24h&to=now [06:26:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.057 second response time [06:34:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [06:36:38] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [06:39:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.122 second response time [06:44:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:46:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.002 second response time [06:47:24] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2989006 (10Marostegui) db2012 caught up nicely so I believe this ticket can be closed. We can disc... [06:48:25] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2989008 (10Marostegui) 05Open>03Resolved a:03jcrespo [06:53:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:00] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 2.622 second response time [07:00:17] I can see a lot of "Killed process 44177 (thumbor) total-vm:3852212kB, anon-rss:968968kB, file-rss:17740kB" on thumbor1001 [07:00:26] the OOM is having a party [07:02:25] elukey: yeah but that's due to MemoryLimit=1G in the systemd unit file [07:02:37] if you look at the numbers, OOM is coming out at 1G [07:03:38] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.162 second response time [07:04:32] elukey: I am trying raising the number of open files but I doubt that's an actual solution [07:05:16] I was trying to figure out if there is a specific req that triggers this mess [07:05:35] yeah me too.. for a long time.. haven't yet figured out something [07:06:32] wow I just saw your graph about network traffic :D [07:07:51] elukey: and it's a many concurrent things happening at the logs [07:08:01] segfault for rsvg-convert [07:08:05] oom coming out [07:08:12] to enforce the 1G limit [07:08:20] and too many open files logged by thumbor [07:08:40] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989026 (10Marostegui) >>! In T124307#2987587, @Ottomata wrote: > @Marostegui ok! So the T125135 auto-increment thing is a very small piece of this larger issue. > > Let's see if we can... [07:09:16] or stuff like ValueError: could not convert string to float: N/A [07:09:28] akosiaris: yes I got now how the systemd services are done, now I get why you want to raise the number of open files :) [07:09:46] I think I stopped the error on thumber1002 [07:10:10] but saying I know why would be a lie [07:10:20] I may have just treated the symptom [07:10:48] at least now it's getting throttled by poolcounter [07:10:59] Feb 1 07:10:50 thumbor1002 thumbor@8829[124089]: 2017-02-01 07:10:50,956 8829 thumbor:ERROR [ImagesHandler] Throttled by PoolCounter: thumbor-render-e057a52d77752f98eec3424534ee8dbd4d54896f {'workers': 2, 'maxqueue': 100, 'timeout': 8} [07:11:16] I have no idea about the pool counter [07:11:31] :( [07:11:36] oh thumbor got recently supporting for rate limiting via poolcounter [07:11:54] it's effectively just a choke to protect itself [07:12:05] same way as mediawiki does [07:12:21] we 've got to figure out what's up with all that extra traffic though [07:13:24] I tried to check the nginx access log [07:13:30] !log restart thumbor process on thumbor1001, thumbor1002, apply a different LimitNOFILE on thumbo1002 [07:13:33] Failed to log message to wiki. Somebody should check the error logs. [07:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:53] ETOOMANYSTASHBOTS [07:14:00] will deal with that later though [07:14:12] elukey: and all you get is ms-fe hosts [07:14:43] yeah.. taking a look at X-Forwarded-For I get what appears to be legit traffic at first sight [07:15:05] I was looking at req pattern but nothing really stands [07:15:06] mmmmm' [07:15:31] could it be a retry of some sort? Failures generating more traffic? [07:15:39] I sure hope not [07:16:31] so, my LimitNOFILE might very well have been a placebo [07:16:51] a simple restart on thumbor1001 (without any LimitNOFILE change) seems to have the same effect [07:18:14] er... elukey so... whatever was going on.. it's not any more [07:18:18] the traffic is gone [07:18:39] well the outbound pattern at least [07:19:10] (the inbound pattern going away would actually be bad) [07:19:30] that's bad.. it means thumbor somehow manages state [07:19:49] and it's meant to be stateless [07:20:08] sigh, I got no pages for thumbor?! anyways the outbound traffic could be uploads back to swift [07:20:27] godog: o/ [07:20:38] godog: o/ [07:20:51] akosiaris: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=thumbor1001&var-network=eth0&from=now-24h&to=now&panelId=10&fullscreen [07:21:13] hey elukey akosiaris, thanks for taking a look! [07:21:31] mmmm no I got fooled by the graph, nevermind [07:22:00] godog: let me know if I can help in any way, Alex did all the work :) [07:22:00] elukey: this one is also a wat moment https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?panelId=9&fullscreen&from=now%2Fw&to=now [07:22:30] probably unrelated but still [07:22:38] seeing it this morning got me agitated [07:23:00] maybe a bot? [07:23:07] godog: so I haven't really done anything apart from restarting all thumbor instances [07:23:33] on thumbor1002 a couple of times cause I changed LimitNOFILE (it was useless as it turns out though) [07:24:03] despite the "Too many open files" error being logged in the error log, after the restart I did not notice it anymore [07:25:10] now thumbor is getting once more throttled by poolcounter (as it should?) [07:25:21] odd, looks like some sort of fd leak [07:25:46] that might explain the problem [07:26:17] would it explain all that outbound traffic ? [07:26:26] no, right ? [07:26:53] no it wouldn't :( [07:27:28] I am taking a closer look as soon as laptop restarts heh [07:27:37] (03PS3) 10Elukey: Add aqs1008-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) [07:27:42] I am off to breakfast, I 'll be around [07:28:03] thanks again elukey akosiaris! [07:28:41] I'll need to figure out why I got not pages too heh [07:30:38] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.131 second response time [07:31:09] !log Force WB policy on the raid controller db1072 - T156226 [07:31:11] godog: I don't see notifications outgoing to you anyway btw [07:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:13] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [07:31:51] akosiaris: though it did page you for example (?) [07:31:55] godog: probably some timezone setting ? [07:31:58] yes I am [07:32:42] me, apergos, bblack, mutante, paravoid, robbh, volans, tim, yuvi, madhu and moritzm [07:32:44] but not you [07:33:05] sorry for all the pings btw to all of the above :-) [07:33:43] (03CR) 10Elukey: [C: 032] Add aqs1008-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [07:33:44] 06Operations, 10DBA: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2989059 (10Marostegui) Let's not do this until db1072 issues are fixed (T156226#2984963) [07:34:43] mhh yeah CET vs CEST awake hours [07:35:01] anyways [07:41:33] !log bootstrapping aqs1008-a on aqs1008 (new AQS cassandra node) [07:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:48] !log rolling restart of cassandra in eqiad to pick up openjdk and NSS security updates [07:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335404 (https://phabricator.wikimedia.org/T153300) [07:47:58] all right aqs1008-a is bootstrapping [07:49:15] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335404 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [07:50:39] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335404 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [07:52:00] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2061 - T153300 (duration: 00m 53s) [07:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:04] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [07:53:25] !log Deploy alter table metawiki.pagelinks db2061 - T153300 [07:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:04] (03CR) 10jenkins-bot: db-codfw.php: Depool db2061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335404 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [08:03:23] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:06:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:08:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:09:23] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:11:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:12:23] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:12:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:14:23] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:18:12] (03PS7) 10Juniorsys: Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) [08:18:31] (03PS6) 10Juniorsys: etcd: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) [08:18:49] (03PS5) 10Juniorsys: eventlogging/eventstreams: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) [08:19:07] (03PS7) 10Juniorsys: extdist: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334284 [08:19:26] (03PS5) 10Juniorsys: labs modules linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) [08:19:35] (03PS5) 10Juniorsys: ldap: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334291 (https://phabricator.wikimedia.org/T93645) [08:19:52] (03PS6) 10Juniorsys: librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) [08:19:55] <_joe_> friendly12345: do not rebase all your patches every day :P [08:20:09] (03PS5) 10Juniorsys: lvm/lvs: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) [08:20:16] (03PS5) 10Juniorsys: mysql: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) [08:20:31] (03PS5) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334299 (https://phabricator.wikimedia.org/T93645) [08:20:47] (03PS5) 10Juniorsys: ores/otrs/package_builder: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334300 (https://phabricator.wikimedia.org/T93645) [08:20:59] (03PS5) 10Juniorsys: openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) [08:21:20] (03PS5) 10Juniorsys: profile linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) [08:21:31] (03PS5) 10Juniorsys: puppet/puppet_compiler: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334307 (https://phabricator.wikimedia.org/T93645) [08:22:18] (03PS5) 10Juniorsys: quarry: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) [08:22:42] (03PS5) 10Juniorsys: role: Linting changes (backup,bastionhost+others) [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) [08:22:52] (03PS5) 10Juniorsys: redis: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334311 (https://phabricator.wikimedia.org/T93645) [08:23:29] (03PS5) 10Juniorsys: Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) [08:23:45] (03PS5) 10Juniorsys: graphoid/gridengine/grub/haproxy/hhvm lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334319 (https://phabricator.wikimedia.org/T93645) [08:24:02] (03PS5) 10Juniorsys: ifttt/imagemagick/initramfs/interface lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334320 (https://phabricator.wikimedia.org/T93645) [08:24:16] (03PS5) 10Juniorsys: Puppet style: Use one line per include/require [puppet] - 10https://gerrit.wikimedia.org/r/334322 [08:35:24] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:48:59] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2989137 (10fgiunchedi) 05Resolved>03Open Reopening this as we've seen what looks like a fd leak on thumbor for swift connections today. In additio... [09:03:43] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:09:48] (03PS1) 10Marostegui: site.pp: Change db1064 to ROW [puppet] - 10https://gerrit.wikimedia.org/r/335407 (https://phabricator.wikimedia.org/T153743) [09:12:03] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:14:03] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5300/ compiles fine and only changes db1064's binlogs from MIXED to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/335407 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:14:27] (03PS1) 10Filippo Giunchedi: lvs: make thumbor.svc alert non-critical [puppet] - 10https://gerrit.wikimedia.org/r/335408 [09:24:30] (03CR) 10Filippo Giunchedi: [C: 032] lvs: make thumbor.svc alert non-critical [puppet] - 10https://gerrit.wikimedia.org/r/335408 (owner: 10Filippo Giunchedi) [09:25:27] (03PS5) 10Elukey: Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) [09:35:04] (03PS1) 10Muehlenhoff: Remove access credentials for jhobs [puppet] - 10https://gerrit.wikimedia.org/r/335411 [09:37:34] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for jhobs [puppet] - 10https://gerrit.wikimedia.org/r/335411 (owner: 10Muehlenhoff) [09:40:01] phab's mysql server down? [09:40:06] > Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [09:40:49] #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement [09:40:54] I am checking [09:40:57] * volans looking [09:41:01] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:41:31] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [09:41:31] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [09:41:55] That looks potentially related [09:42:27] oh men [09:42:36] unscheduled phabricator maintenance ! :] [09:42:39] the db itself is up and running (db1043) [09:42:46] hashar: woot? [09:44:18] marostegui: na I was just wondering what is happeneing with phabricator / the db [09:44:29] then I have no clue about Phabricator so ... [09:46:54] oh [09:47:26] oh? [09:47:29] marostegui: so the thing is that last week when upgrading phabricator the m3 slaves eventually exploded [09:48:02] https://phabricator.wikimedia.org/T156373 [09:48:12] yes, I am aware [09:48:26] but supposedly the mighty DBA had an epic croisade/journey to fix up MariaDB code itself [09:48:30] so in theory that should be fixed [09:48:34] hashar: it is fixed indeed [09:48:45] or the patch does not cover 100% of cases, or something else entirely :] [09:48:55] both dbs are up and running, I am trying to see why haproxy is complaining [09:50:53] up for me now [09:51:05] marostegui: --^ [09:51:11] yeah, I am restarting the proxies [09:51:31] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [09:51:31] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [09:51:35] very weird [09:51:41] should be up now [09:51:52] I will follow this up [09:51:54] magicc [09:52:37] marostegui: i should have gathered some stats before you restarted them [09:52:46] I got some stats [09:53:15] (from haproxy anyways) [09:53:25] going to create a task so we can put stuff together there :) [09:53:32] marostegui: how/where did you get the stats? [09:53:40] (just to know how to do it :) [09:53:49] echo "show stat" | socat unix-connect:/tmp/haproxy.socket stdio [09:53:57] Nikerabbit: the search slowness you were seeing was probably prodromic to the database crash [09:53:58] elukey: socket [09:54:25] Nemo_bis: technically the databases didn't crash [09:54:25] from decades of role playing games and fantasy book there is one thing I learned: do not ask secrets to the Wizards Of the Db! [09:57:31] _joe_: Oh, what should I be doing? [09:58:25] are you taking open suggestions ? /me says dance ! [09:58:28] Nemo_bis: phabricator was working well, only some search did not [09:58:34] <_joe_> thedj: ahah [09:58:37] Nikerabbit: was completely down for me [09:58:45] <_joe_> friendly12345: just no need to rebase them until before merging [09:58:53] Nemo_bis: yes, that was later, was it not [09:59:23] _joe_: Er I don't think I have merge access [09:59:42] <_joe_> friendly12345: exactly, let whoever merges a patch rebase it [09:59:59] _joe_: Okay [10:02:54] hashar, Nemo_bis mysql actually crashed [10:03:31] ok thanks [10:07:16] This is the task: https://phabricator.wikimedia.org/T156905 [10:12:01] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The direction is surely the right one, but there are a series of issues with this patch, both in terms of coding style and functionality." (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [10:27:48] (03PS1) 10Juniorsys: zuul: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335418 [10:36:16] (03PS1) 10Juniorsys: xdummy: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335422 [10:39:01] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:39:15] (03CR) 10Hashar: [C: 031] zuul: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/335418 (owner: 10Juniorsys) [10:39:40] (03CR) 10Muehlenhoff: [C: 031] "Some comments for improval, but this looks good to me in general!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [10:56:24] 06Operations, 10MediaWiki-Database, 10MediaWiki-General-or-Unknown, 10Wikimedia-General-or-Unknown: 504 Gateway Time-out on https://de.wikipedia.org/w/index.php?title=Wikipedia:L%C3%B6schkandidaten&action=info - https://phabricator.wikimedia.org/T156537#2989390 (10daniel) Yes, finding the first revision of... [10:57:52] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989313 (10Marostegui) [11:06:43] !log kartik@tin Started deploy [cxserver/deploy@0e4ae4f]: (no justification provided) [11:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:47] !log kartik@tin Finished deploy [cxserver/deploy@0e4ae4f]: (no justification provided) (duration: 02m 04s) [11:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:08] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989405 (10Marostegui) >>! In T156905#2989343, @Volans wrote: > Looks like there was some heavy load on the server in the ~20 minutes before the OOM: > > https:/... [11:14:03] !log bounce leaking thumbor@8813 on thumbor1001 [11:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:06] ah. Forgot to provide justification :/ [11:16:52] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2989419 (10fgiunchedi) [11:23:44] (03CR) 10Hashar: "Thank you very much Chase!! That will definitely be useful." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [11:25:33] !log removing ntfs-3g from various trusty servers [11:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:54] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989430 (10Paladox) @Marostegui hi, it looks like the server is running version Server version: 10.0.23-MariaDB-log Should we update it to 10.0.29 the package th... [11:38:46] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989448 (10Paladox) @Marostegui could it b a memory leak? could this https://github.com/MariaDB/server/commit/b7dc830 be the fix? [11:40:17] (03CR) 10Muehlenhoff: [C: 031] Cumin: allow connection to the targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:45:33] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335430 [11:46:23] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989466 (10jcrespo) > However, db1048 (the slave) also crashed and that one is only supposed to have the replication thread running right? No, at around 09:39:31... [11:47:13] (03PS18) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:47:24] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989467 (10Marostegui) >>! In T156905#2989430, @Paladox wrote: > @Marostegui hi, it looks like the server is running version Server version: 10.0.23-MariaDB-log >... [11:49:43] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335430 (owner: 10Marostegui) [11:51:19] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335430 (owner: 10Marostegui) [11:51:28] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335430 (owner: 10Marostegui) [11:51:51] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989474 (10Paladox) does this mean that phabricator needs improvements to it's query? [11:52:45] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2061 - T153300 (duration: 00m 40s) [11:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:49] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [11:57:16] (03CR) 10Volans: "Addressed comments, latest puppet compiler output available here:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:04:44] !log elukey@tin Started deploy [analytics/refinery@e6254a4]: (no justification provided) [12:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:13] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783005 (10phuedx) ^ Captured in {T156910}. [12:09:26] !log elukey@tin Finished deploy [analytics/refinery@e6254a4]: (no justification provided) (duration: 04m 41s) [12:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:41] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 8567.00 seconds [12:10:04] ^ that is expected [12:10:07] I will silence it [12:13:41] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [12:14:41] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.006 second response time [12:20:31] PROBLEM - DPKG on labvirt1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:20:41] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:35:27] (03PS1) 10Nschaaf: (in progress) Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) [12:37:11] (03CR) 10Nschaaf: [C: 04-1] "The script still needs tested and merged prior to this being merged." [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [12:38:56] (03PS1) 10Jcrespo: Revert "phabricator: Increase phabricator dbs buffer pool" [puppet] - 10https://gerrit.wikimedia.org/r/335438 [12:39:21] (03PS2) 10Jcrespo: Revert "phabricator: Increase phabricator dbs buffer pool" [puppet] - 10https://gerrit.wikimedia.org/r/335438 [12:40:41] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:45:56] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2989570 (10Gilles) 05Open>03Resolved Moved that concern to T156913 [12:46:51] PROBLEM - DPKG on labtestcontrol2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:51:16] (03CR) 10Jcrespo: [C: 032] Revert "phabricator: Increase phabricator dbs buffer pool" [puppet] - 10https://gerrit.wikimedia.org/r/335438 (owner: 10Jcrespo) [12:52:53] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989591 (10Marostegui) >>! In T156905#2989474, @Paladox wrote: > does this mean that phabricator needs improvements to it's query? I believe so, or at least on t... [12:53:10] (03CR) 10Marostegui: "+1 for the record :)" [puppet] - 10https://gerrit.wikimedia.org/r/335438 (owner: 10Jcrespo) [13:05:42] 06Operations, 10MediaWiki-Database, 10MediaWiki-General-or-Unknown, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: 504 Gateway Time-out on https://de.wikipedia.org/w/index.php?title=Wikipedia:L%C3%B6schkandidaten&action=info - https://phabricator.wikimedia.org/T156537#2989604 (10jcrespo) > finding t... [13:07:25] (03PS1) 10Gilles: Upgrade to 0.1.34 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/335442 (https://phabricator.wikimedia.org/T156913) [13:08:31] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 55 probes of 270 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:09:41] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:10:21] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1671.20 Read Requests/Sec=1654.40 Write Requests/Sec=503.50 KBytes Read/Sec=35616.80 KBytes_Written/Sec=4946.00 [13:17:31] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:21:37] !log Clean up db1043 replication thread (it was replicating from db1048 which looks like an old thing) - T156905 [13:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:44] T156905: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905 [13:26:21] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=162.30 Read Requests/Sec=164.00 Write Requests/Sec=150.00 KBytes Read/Sec=2616.00 KBytes_Written/Sec=1301.20 [13:32:31] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:38:47] !log issue sudo hdparm -Y /dev/sdb on bast3001 to force a problematic drive to sleep [13:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:54] let's see what this will do [13:39:07] !log Deploy alter table dbstore1002 metawiki.pagelinks - T153300 [13:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:11] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [13:39:31] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:44:01] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [13:45:01] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3215535 keys, up 93 days 5 hours - replication_delay is 0 [13:49:31] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:51:08] Hi. [13:51:58] I'll handle the EU SWAT. [13:52:52] Dereckson: great [13:52:59] Nikerabbit: before to deploy https://gerrit.wikimedia.org/r/#/c/333579/2/languages/messages/MessagesJv.php we should provide aliases in config too, shouldn't we? [13:53:02] I was just about to ask hashar if we wants it :) [13:53:37] Nikerabbit: it seems Siebrand recommended that for previous similar patches to avoid disruption [13:53:54] Dereckson: I just added 1 more to it! I had moved it to a different swat but just moved it back again! [13:54:04] addshore: ok [13:55:22] akosiaris: good idea re:hdparm, did it work? [13:56:27] Dereckson: what config? all aliases are there in the file [13:56:49] oh, and probably needs l10n-sync or something, that might take a file [13:56:54] (03PS1) 10Muehlenhoff: Remove ntfs-3g on precise/trusty [puppet] - 10https://gerrit.wikimedia.org/r/335444 [13:57:01] Nikerabbit: in IS, and I've no idea *why* Siebrand recommended that [13:57:13] (InitialiseSettings.php) [13:57:35] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989663 (10Ottomata) @mforns, can you comment about large DELETES? Do they happen often? How large are they when it happens? Would LOAD DATA actually help replication? [13:57:52] Dereckson: maybe if it was something wikimedia specific, but I don't see that here [13:58:21] okay we'll try it dry in this case [13:58:40] godog: looks like it [13:58:52] -C says the disk in now in standby [13:59:01] nope scratch that [13:59:09] md just woke it up [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T1400). [14:00:04] TabbyCat: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:21] (03PS4) 10Dereckson: Enable ElectronPdfService extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) (owner: 10Addshore) [14:00:38] sigh, I wonder if 'eject' would do it, worst case we can kick it out manually of smartctl's view [14:00:40] (03PS5) 10Dereckson: Enable ElectronPdfService extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) (owner: 10Addshore) [14:01:26] godog: I am thinking about just disabling smart [14:02:00] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) (owner: 10Addshore) [14:02:08] I was expecting kicking it out of the array to pretty much stop all activity but now the disk is trying in the background to get the 89 sectors pending reallocation reallocated [14:02:24] and every time it succeeds doing one, we get an email [14:03:46] addshore: zuul is queuing operations-mw-config-composer-hhvm-jessie [14:04:29] (03Merged) 10jenkins-bot: Enable ElectronPdfService extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) (owner: 10Addshore) [14:04:41] (03CR) 10jenkins-bot: Enable ElectronPdfService extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) (owner: 10Addshore) [14:04:57] addshore: live on mwdebug1002 [14:05:00] checking [14:05:26] twentyafterfour: you're handling the train this week? If so, cannot delete non-empty directory: php-1.29.0-wmf.4/cache/l10n cannot delete non-empty directory: php-1.29.0-wmf.3/cache/l10n [14:06:03] Dereckson: looks good! [14:06:49] logs look good too, syncing [14:08:24] jouncebot: now [14:08:24] For the next 0 hour(s) and 51 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T1400) [14:08:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 12 probes of 270 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:09:50] Hi TabbyCat, so namespaces are clean before the changes, that will be trivial to run namespacesDupe afterwards so. [14:10:04] I've sent the changes to gate and submit queue [14:10:06] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable ElectronPdfService on meta (T150943) (duration: 00m 48s) [14:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:11] T150943: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943 [14:10:15] Dereckson: good to know [14:10:27] so it's just changing namespaces and check for conflicts later [14:10:29] wmf/1.29.0-wmf.9 is merged, wmf/1.29.0-wmf.8 is pending [14:10:33] Dereckson: looks good! :) [14:10:36] 8? [14:10:39] addshore: awesome [14:10:40] ty [14:10:43] I think I did 9 and 10 [14:10:48] .10 [14:10:58] (03CR) 10Filippo Giunchedi: "wrong assignment, LGTM otherwise" (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/335442 (https://phabricator.wikimedia.org/T156913) (owner: 10Gilles) [14:11:06] okay merged too [14:11:33] should be live now? [14:11:54] Nope, I need to deploy before [14:12:04] ah true [14:12:16] !log uploaded hhvm 3.12.12 to carbon [14:12:18] (03PS4) 10Rush: wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [14:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:07] TabbyCat: live on mwdebug1002 [14:13:21] logging and checking [14:13:45] Revent: ping [14:13:49] Reedy: ping [14:13:55] Revent: sorry, I wanted to pring Reedy [14:14:00] (03PS6) 10Elukey: Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) [14:14:02] Dereckson: Hi [14:14:04] How rude. :P [14:14:18] (03PS2) 10Gilles: Upgrade to 0.1.34 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/335442 (https://phabricator.wikimedia.org/T156913) [14:14:27] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989711 (10epriestley) I'm not sure why the query is slow or requires a significant amount of memory. The structure of the query specifically tries to avoid this,... [14:14:33] Reedy: hi, you left an unmerged commit in LQT for wmf.9 : https://gerrit.wikimedia.org/r/#/c/335212/ [14:14:45] Dereckson: Not quite [14:14:48] (03PS5) 10Rush: wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [14:14:52] It was a security patch before I put it in gerrit [14:15:14] ah it's probably a commit missing to update php-1.29.0-wmf.9 branch so [14:15:23] Do you want me to cherry pick it onto .9 to tidy it up? [14:15:28] Dereckson: looks good on mwdebug1002 [14:15:44] Reedy: would be cleaner I think, yes [14:16:29] liquid threads.... argh... /me scratches [14:16:59] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989717 (10epriestley) One vague possibility is that you may have a few documents which are extremely large: for example, perhaps someone wrote a 1GB comment on a... [14:17:02] Reedy: apparently a 5df02a1fc8c31534f40368ab89f2f8c4ffe73a9c -> f9c4fc1255bc981005a1ab31b63f40dd83b3d607 [14:17:23] It probably wants git reset HEAD~1 --hard in extensions/LiquidThreads [14:17:33] Then pull in core, update submodule when the patch merges [14:18:24] Dereckson: Want me to clean up, or do you want to? [14:19:24] git rebase origin/wmf/1.29.0-wmf.9 in php-1.29.0-wmf.9 folder should be enough I think, so it gets the commit at core level, extension folder looks up to date [14:19:47] TabbyCat: good [14:21:25] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.34 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/335442 (https://phabricator.wikimedia.org/T156913) (owner: 10Gilles) [14:21:44] !log dereckson@tin Synchronized php-1.29.0-wmf.10/languages/messages/MessagesJv.php: Update namespace localisation in Javanese (T155957) (duration: 00m 45s) [14:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:48] T155957: Rename talk namespaces in Javanese wikis: 'Dhiskusi' to 'Parembugan' - https://phabricator.wikimedia.org/T155957 [14:22:16] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989733 (10jcrespo) [14:22:32] akosiaris: very kind of smartmontools indeed [14:23:20] Reedy: yes no, a submodule update was needed too, all is clean now, thanks for the change [14:23:47] (with a git diff empty) [14:24:40] !log dereckson@tin Synchronized php-1.29.0-wmf.9/languages/messages/MessagesJv.php: Update namespace localisation in Javanese (T155957) (duration: 00m 40s) [14:24:43] TabbyCat: live on wmf9+wmf10 [14:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] * TabbyCat checks on live [14:25:28] !log dropping and replacing events on db1057 - db1052 T156008 [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:32] T156008: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008 [14:26:02] and namespacesDupes still clean, apparently they didn't create pages with the new names :) [14:26:15] Dereckson: hmm, still old namespacenames displayed [14:26:20] caching issues? [14:26:43] and it works on mwdebug1002? [14:26:55] or the issue exists in both? [14:27:05] I mean on live [14:27:09] try to purge a page [14:27:10] not mwdebug [14:27:15] right, will do [14:27:16] you should see the new namespace instead [14:27:53] not yet [14:28:04] did purge and ctrl + shift + r [14:29:11] Dereckson: special:version there says MediaWiki 1.29.0-wmf.9 (70cc771) [14:29:12] 15:40 30 ene 2017 [14:29:31] and we did updated wmf.9 and wmf.10 [14:30:31] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989770 (10Marostegui) >>! In T124307#2989663, @Ottomata wrote: > > @Marostegui , Would LOAD DATA actually help replication? If you need to do massive data imports into the DB, it will h... [14:30:33] https://jv.wikipedia.org/wiki/Dhiskusi:Foo doesn't redirect me to Parembugan:Foo [14:30:38] (03CR) 10Elukey: [C: 032] Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [14:31:10] Localisation is always fun. [14:31:35] (03CR) 10Ottomata: [C: 031] Remove otto from aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/335012 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [14:31:37] Dereckson: https://jv.wikipedia.org/wiki/Astamiwa:ApiSandbox#action=query&format=json&meta=siteinfo&indexpageids=1&redirects=1&converttitles=1&siprop=namespaces%7Cnamespacealiases [14:31:38] Nikerabbit: probably why :p [14:31:42] (03CR) 10Ottomata: [C: 031] Remove otto from piwik-roots [puppet] - 10https://gerrit.wikimedia.org/r/335010 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [14:31:53] Dereckson: l10nupdate maintenance script needed? [14:32:21] that one or a full scap, yes, but let me check some things befoe [14:33:01] sure sure [14:33:32] "to sync i18n changes, you must use scap sync." [14:34:10] I think I'll document it somewhere on Wikitech if not already [14:34:12] Nikerabbit: yet, namespaces behave differently than messages [14:34:28] TabbyCat: there are some notes in the incident reports [14:34:57] Nikerabbit: and strangely, it worked on mwdebug1002 [14:34:59] Dereckson: not that much, both are in the l10n-cache [14:35:05] ok [14:36:14] By the way, something unexpected works: the name in the tabs [14:37:21] !log dereckson@tin Started scap: Full scap to propagate a core namespace l10n change [14:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:49] Dhiskusi_Panganggo: <-- still displays this for me [14:38:02] 'Started' [14:38:06] will be done in 40-50 minutes [14:38:13] (03CR) 10Gehel: [C: 031] "At least for elasticsearch (the only Discovery service still running on Trusty), ntfs-3g can be removed." [puppet] - 10https://gerrit.wikimedia.org/r/335444 (owner: 10Muehlenhoff) [14:38:14] o_O [14:38:26] so much data to process I guess [14:38:33] I'll be around if needed [14:38:46] That's why generally we only sync one file or one folder. [14:39:13] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989845 (10Ottomata) EventLogging is a stream of data. We can do batching because the data is consumed from Kafka, and then inserted into MySQL via a python MySQL client. So we could con... [14:39:25] so what are you sync. now, the whole mediawiki/core ? [14:39:27] Nikerabbit: So I guess some prefer to put in InitialiseSettings.php a copy of the change to avoid theg full scap [14:39:43] TabbyCat: no the whole codebase, core + extensions + l10n, everything [14:40:00] aargh [14:40:01] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:41:01] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [14:41:23] that was me ^ [14:41:36] !log upgrade thumbor to 0.1.34 [14:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:13] godog: thumbor was flapping a bit last night, I didn't get a chance to really inspect it but shuld be turn off actual paging for thumbor? [14:45:50] TabbyCat: actually, to rebuild l10n cache is the longer part [14:46:25] TabbyCat: cache for 9 is done, now 10 [14:47:27] chasemp: yeah, this morning alex and luca babysat/debug, I've turned off paging for thumbor :)) thanks! sorry about hte spam [14:47:40] no worries man [14:47:43] cool [14:48:15] Dereckson: then if 9 is done I should be able to see the changes [14:48:36] (03PS1) 10Elukey: Replace mc2001 with mc2019 in Mediawiki Redis shards [puppet] - 10https://gerrit.wikimedia.org/r/335449 (https://phabricator.wikimedia.org/T155755) [14:48:46] actually no [14:55:50] (03CR) 10Muehlenhoff: "@gehel: it's already removed on (almost) all trusty hosts, this patch is just to deal with trusty reimages." [puppet] - 10https://gerrit.wikimedia.org/r/335444 (owner: 10Muehlenhoff) [14:56:08] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2989873 (10Marostegui) LOAD DATA is a lot faster to bulk lots of data in the DB, there is a lot less overhead in parsing SQL statements and all the processes around that parsing. This is... [14:59:16] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2989879 (10Gehel) I'm going to let this run until Monday January 6 on beta, before starting to migrate production. Let me know if you see anything u... [14:59:55] (03PS2) 10Elukey: Replace mc2001 with mc2019 in Mediawiki Redis/Memcached shards [puppet] - 10https://gerrit.wikimedia.org/r/335449 (https://phabricator.wikimedia.org/T155755) [15:04:37] heya marostegui [15:05:55] i get that load data is faster in general, but i don't understand how its also faster for replication. does what gets put in the binlog actually differ between stuff that was inserted via SQL vs. LOAD DATA? [15:07:43] tto: you've a good real wiki candidate to test your script once reviewed? [15:08:27] tto: if not, I'll let pl.wikisource task open and blocked by dev to validate your script on that one [15:08:30] Dereckson: enwiki of course ;) Seriously though, practically every wiki has some drift in the category table. Pick any small to medium sized wiki [15:08:56] plwikisource sounds as good as any. [15:08:57] Okay, if there are so much candidats, I can do plwikisource now with legacy populate. [15:09:10] pl.wikisource is small enough for that [15:09:12] It'll be interesting to see what the performance is like [15:09:42] The performance of the script is probably heavily dependent on size of categorylinks [15:10:07] It doesn't do any batching and uses the rather horrible queries in Category::refreshCounts [15:11:16] Dereckson: tto just do foreach all.dblist :P [15:13:07] Dereckson: how's scap going? :) [15:13:48] TabbyCat: scap-cdb-rebuild: 84% (ok: 266; fail: 0; left: 50) [15:14:08] Dereckson: seeing some changes on MessagesJv already :D [15:14:15] yep it synced [15:14:33] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:14:49] (03CR) 10Filippo Giunchedi: [C: 032] Require cassandra-wmf-tools and jvm-utils for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/335213 (owner: 10Elukey) [15:14:58] we'll have to run namespaceDupes.php again I think [15:15:05] after the full sync. [15:15:14] Dereckson, I'm going now. Please make sure to write on the task how populateCategory went [15:15:48] ok [15:15:55] ottomata: sorry, I am a bit busy. The behaviour is the same on the slave, the replication thread will send the same thing and you don't have to parse all the sql, keep it memory etc [15:16:25] tto: already done [15:16:33] RECOVERY - DPKG on labvirt1005 is OK: All packages OK [15:16:50] Dereckson, ok. Must be really small then :) [15:17:17] https://phabricator.wikimedia.org/T156670#2989908 [15:17:32] !log dereckson@tin Finished scap: Full scap to propagate a core namespace l10n change (duration: 40m 10s) [15:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:40] !log `mwscript populateCategory.php plwikisource --force` to refresh categories stats (T156670) [15:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:44] T156670: Refresh category counts for plwikisource - https://phabricator.wikimedia.org/T156670 [15:18:53] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2989913 (10Cmjohnson) Requested a new BBU from Dell..will update task with shipping information. Confirmed: Request 943118173 was successfully submitted. [15:19:03] RECOVERY - DPKG on labtestcontrol2001 is OK: All packages OK [15:19:53] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:19:58] TabbyCat: so I've ran again both and nope, they didn't create any page using the new namespaces names before syncing [15:20:12] Dereckson: I'd call this resolved then [15:21:32] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#2989919 (10Cmjohnson) A new disk has been ordered with Dell. I will update task once the request has been filled Confirmed: Request 943118422 was successfully submitted. [15:22:15] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2989923 (10jcrespo) MySQL, when compiled with openssl support, provides very easy way to check the time: ``` | Ssl_server_not_after | Jun 29 21:52:32 2020 GMT | Ssl_se... [15:23:33] RECOVERY - Hadoop NodeManager on analytics1053 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:26:40] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989959 (10Marostegui) Hey @epriestley - thanks for the fast response! >>! In T156905#2989711, @epriestley wrote: > I'm not sure why the query is slow or requi... [15:32:14] (03PS2) 10Elukey: Require cassandra-wmf-tools and jvm-utils for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/335213 [15:33:06] (03PS5) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [15:35:13] (03CR) 10jerkins-bot: [V: 04-1] Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 (owner: 10Muehlenhoff) [15:36:56] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2989978 (10fgiunchedi) @demon I was indeed able to build stretch's git as-is on jessie, resulting in `2.11.0-2~bpo8+1`. Uploading it internally to `... [15:37:04] (03PS6) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [15:37:16] !log upgrading canary app servers to new HHVM package (initially mwdebug and mw1261) [15:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:50] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:40:35] !log preparing db1067 for reimage to jessie [15:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:05] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989990 (10Marostegui) >>! In T156905#2989959, @Marostegui wrote: > > ``` > | 50776 | root | localhost | phabricator_search | Query | 605 |... [15:41:48] (03CR) 10Dzahn: [C: 031] Remove ntfs-3g on precise/trusty [puppet] - 10https://gerrit.wikimedia.org/r/335444 (owner: 10Muehlenhoff) [15:42:40] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:44] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2989991 (10epriestley) Hrrm. For comparison, a similar inner query on `secure.phabricator.com` (searching for "phabricator" instead of "affecting translatewiki.ne... [15:42:50] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:00] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:00] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:00] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:14] oh? [15:43:24] !log restart puppetdb on nitrogen [15:43:26] that would be me [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:40] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:40] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:40] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:40] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:15] Failed to find facts from PuppetDB at nitrogen.eqiad.wmnet [15:44:17] ah [15:44:32] jynus: yeah.. expected.. openjdk security upgrade [15:44:34] I thougt it was software related [15:44:37] no problem [15:44:45] (03CR) 10Andrew Bogott: [C: 031] "I defer to Hashar about what the reasonable thresholds are, but otherwise this looks great." [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [15:46:00] !log restart puppetdb on nihal (openjdk upgrade) [15:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:11] and now we have to endure the puppet alert storm [15:46:18] (03CR) 10Andrew Bogott: [C: 031] "I presume the intent is that these thresholds represent critical failures? Otherwise we might want to have warning levels as well." [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [15:46:25] hmm maybe we should just re-run puppet once when it fails [15:46:42] it should solve most of these problems [15:46:49] sounds like a good idea, yea [15:47:26] with a random pause [15:47:40] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:47:48] jynus: it's anyway on an --splay [15:47:55] ah, nice [15:47:57] it has the random pause embedded anyway [15:48:00] PROBLEM - puppet last run on mc2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:00] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:00] PROBLEM - puppet last run on mwlog2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:00] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:00] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:06] I didn't know about that [15:48:10] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:10] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:10] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:38] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) @ottomata: we do not delete data from eventlogging (other than the purging that it should happen after 90 days) the system just inserts batches of records. [15:48:40] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:40] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:42] we can always just /etc/init.d/ircecho stop to kill the bot and once it comes back by itself on next run, most of them are usually over [15:48:50] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:50] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:50] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:03] mutante: yeah good idea [15:49:16] I 'll have to hand it though to puppetdb.. 3months 5 days without a hiccup [15:50:10] !log stop ircecho for a while to weather out most of the puppet alert storm [15:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:39] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2990011 (10demon) That seems like a reasonable course to go for now. Then after further testing, perhaps roll it out further to the rest of the serv... [15:52:03] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990013 (10Paladox) What if we dropped the phabricator_search table then re created and then do the reindexing for mysql, would that work? [15:53:46] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2990018 (10jcrespo) > purging that it should happen after 90 days How do you implement purging? That surely must run deletes or some kind of updates? [15:53:51] (03CR) 10Rush: "Andrew, right, I'm open to whatever. It think most initial values won't survive the month but we'll learn so my thinking is to go straigh" [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [15:54:27] (03PS1) 10Hoo man: Use 5 instead of 4 shards when dumping Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/335457 [15:56:03] (03CR) 10Andrew Bogott: [C: 031] "> go straight crit for these conditions as they shouldn't exist and then back can backoff" [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [15:56:37] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990019 (10Marostegui) >>! In T156905#2989991, @epriestley wrote: > Hrrm. For comparison, a similar inner query on `secure.phabricator.com` (searching for "phabri... [15:58:58] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990020 (10epriestley) Here's another possible formulation of the query based on Googling "FULLTEXT initialization", although I haven't yet found a real descripti... [16:02:28] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990021 (10jcrespo) Maybe tuning `innodb_ft_cache_size` and `innodb_ft_total_cache_size` is needed. Little to no tuning was done after converting the table from M... [16:03:32] (03CR) 10DCausse: Deploy TextCat Improvements (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) (owner: 10Tjones) [16:04:01] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990030 (10epriestley) Ah! That seems pretty broken -- the same simple query takes `0.03s` on `secure.phabricator.com`, so yours is ~22,000x slower for ~10x more... [16:06:29] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990034 (10jcrespo) > Is that something you're comfortable trying? Yes, that makes lot of sense, too. [16:07:08] (03PS6) 10Paladox: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [16:07:34] (03CR) 10Paladox: "Rebased and also updated it to use the new refracted layout @bblack did." [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [16:07:42] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990035 (10epriestley) Tuning `innodb_ft_*` parameters may also be fruitful, but I don't have any direct experience with it to provide guidance. [16:08:27] (03CR) 10Paladox: varnish misc: add phab2001 as a backend for phab-new (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [16:08:33] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990036 (10Marostegui) >>! In T156905#2990030, @epriestley wrote: > Ah! That seems pretty broken -- the same simple query takes `0.03s` on `secure.phabricator.com... [16:10:40] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:11:10] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:11:40] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:11:40] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:11:40] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:11:46] (03CR) 10Paladox: varnish misc: add phab2001 as a backend for phab-new (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [16:11:50] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:12:00] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:12:00] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2990060 (10Joe) [16:12:00] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:12:10] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:15:50] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:16:00] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:16:00] RECOVERY - puppet last run on mc2023 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:16:00] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:16:00] RECOVERY - puppet last run on mwlog2001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:16:10] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:16:12] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:16:12] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:16:40] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:16:40] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:16:50] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:17:00] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:17:10] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#2990085 (10Gehel) 05Open>03Resolved a:03Gehel This is actually done for some time, we have 3 nodes in codfw, m... [16:17:50] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:18:45] !log Optimizing table search_documentfield on db1048 - T156905 [16:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:50] T156905: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905 [16:20:21] godog: I want to make a collector to gather stats about a couple of REST apis (specifically I want to know what % of requests are rejected, but latency might also be nice.) Can you point me to an existing state-of-the-art example for doing something like this? [16:20:45] <_joe_> andrewbogott: in python? [16:20:52] <_joe_> the client docs are pretty good [16:20:54] !log restarting Yarn Node Manager daemons on all the Hadoop nodes to bandaid a memory leak causing OOMs [16:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:12] <_joe_> I've used them yesterday for the new thing I've been writing [16:21:45] _joe_: In part, I've lost track of what the best practice is these days. Are we still writing diamond collectors like before, or is there a new thing I need to do to get with the Prometheus future? [16:22:13] <_joe_> sorry, I thought you were talking about a prometheus collector [16:22:31] well, exactly, I don't even know what I want :/ [16:22:37] just what I want to measure [16:23:07] <_joe_> ok then godog is probably the authority :P [16:23:31] * godog wearing the authority hat [16:24:09] andrewbogott: do you control said api code? also as _joe_ mentioned is it python? [16:24:19] (03PS1) 10Giuseppe Lavagetto: Initial commit of etcd2-mirror [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/335460 [16:24:46] (03Abandoned) 10Giuseppe Lavagetto: Initial commit of etcd2-mirror [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/335246 (owner: 10Giuseppe Lavagetto) [16:24:51] godog: They are in python but in general I don't control them inasmuch as they're from upstream openstack packages. [16:24:57] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990100 (10epriestley) If we can't get the trivial `SELECT * WHERE MATCH(...)` case performing quickly, I believe the engine isn't going to be usable no matter ho... [16:25:31] godog: I was imagining the collector would just do 'time curl' or something to that effect. [16:27:06] andrewbogott: ah ok, is the api exposing such stats already and you are interested in importing them somewhere so that grafana/icinga can do something with it or more like a blackbox probe, issue requests to the api like a client would do and see what the result is? [16:28:14] godog: the particular problem I'm trying to debug is that the api just refuses connections at random times, probably due to a uwsgi config problem [16:28:19] so I definitely need an external test [16:28:33] (as far as the backend is concerned, those requests aren't even happening.) [16:28:43] so, yeah, blackboxish [16:31:16] andrewbogott: ah ok, I see two ways ATM, one is icinga check_url and friends, the other one check in parallel you could do if the api is accessible from tools is add a check for blackbox_exporter there. We don't have blackbox_exporter yet for prometheus in production :( [16:32:12] godog: I don't think icinga is what I want, since it doesn't go from all failing to all working, but rather some % of things arbitrarily fail [16:32:17] so I need metrics first, and alerting second [16:32:20] (unless I'm misunderstand) [16:35:10] (03CR) 10ArielGlenn: [C: 032] Use 5 instead of 4 shards when dumping Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/335457 (owner: 10Hoo man) [16:35:45] andrewbogott: ah ok, then yeah a check from blackbox_exporter in tools would be the easiest way there, assuming the api is reachable from tools [16:35:58] it isn't [16:36:15] but I can just write a scratch diamond collector, if that's the only way forward currently [16:36:41] yup that'd work too [16:36:46] ok [16:36:55] just trying to stay a la mode :) [16:37:05] haha indeed, thanks for asking! [16:37:40] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:39:40] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:41:40] !log mariadb rolling restart of db2037, db2044, db2051, db2058, db2065 [16:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:55] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990129 (10Paladox) We have elastic search as an experimental option. You can enable it through the pref by going to Developer Settings and under the elastic sear... [16:43:44] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990131 (10Marostegui) Good news: ``` root@MISC m3[phabricator_search]> optimize table search_documentfield; Stage: 1 of 2 'copy to tmp table' 0.023% of stage d... [16:45:35] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow to integrate data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#2990133 (10Joe) [16:48:22] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990173 (10Marostegui) And the big query finished in `5 rows in set (1 min 59.41 sec)`: ``` root@MISC m3[phabricator_search]> SELECT documentPHID, MAX(fieldScore... [16:49:11] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990174 (10Paladox) but our only problem with elasticsearch is you can't reindex both indexes. So elasticsearch index is out of date. we need someway to be able... [16:49:17] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990175 (10Nemo_bis) >>! In T156905#2989343, @Volans wrote: > Looks like there was some heavy load on the server in the ~20 minutes before the OOM: > > https://g... [16:49:36] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990176 (10jcrespo) Let's still try to tune the innodb parameters. 2 minutes is also too much for a simple search. [16:51:30] PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:52:00] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990178 (10Paladox) >>! In T156905#2990175, @Nemo_bis wrote: >>>! In T156905#2989343, @Volans wrote: >> Looks like there was some heavy load on the server in the... [16:52:30] RECOVERY - Hadoop NodeManager on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:54:22] !log Optimize table phabricator_search.search_documentfield on db2012 - T156905 [16:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:26] T156905: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905 [16:56:20] PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:57:50] sorry about the noise, all under control [16:58:20] RECOVERY - Hadoop NodeManager on analytics1042 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:04:29] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990225 (10Marostegui) >>! In T156905#2990176, @jcrespo wrote: > Let's still try to tune the innodb parameters. 2 minutes is also too much for a simple search. A... [17:09:59] SWAT deploys happen times a day, as needed? [17:10:05] *three times [17:10:46] I found the page on Wikitech, nvm :) [17:12:10] musikanimal: but if you bribe the right persons with the right ammount of stroopwaffels they can deploy code just for you outside the schedule [17:12:25] good to know [17:12:33] now that you know the secret, I want stroopwaffels too :P [17:12:39] hehe [17:18:07] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2990353 (10Ottomata) [17:19:40] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987618 (10Ottomata) @Marostegui, @jcrespo, we talked about this today. What is your timeline for replacing these boxes? We want to try to ween people off of EventLogging My... [17:25:49] 06Operations, 06Services (next), 15User-Joe, 15User-mobrovac, 05codfw-rollout: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2990382 (10Joe) 05Open>03Resolved [17:26:17] 06Operations, 06Services (next), 15User-Joe, 15User-mobrovac, 05codfw-rollout: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#1973660 (10Joe) Duplicate of T149617 [17:28:27] 06Operations, 06Analytics-Kanban, 10EventBus, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2990397 (10Nuria) 05Open>03Resolved [17:30:49] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2990439 (10Nuria) [17:41:55] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2990444 (10jcrespo) > What is your timeline for replacing these boxes? The constraint, more than the decommission, is the budget for replacements. I do not know what is the d... [17:44:45] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2990448 (10jcrespo) > What is your timeline for replacing these boxes? BTW, I forgot to answer literally your question, the deadline for replacement is January 2014 (not a ty... [17:49:03] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07HHVM: MediaWiki tests causes HHVM to segfault on Jessie - https://phabricator.wikimedia.org/T156923#2990453 (10hashar) [17:54:45] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990460 (10Paladox) >>! In T156905#2989959, @Marostegui wrote: > Hey @epriestley - thanks for the fast response! > > > >>>! In T156905#2989711, @epriestley wro... [17:57:37] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: MediaWiki tests causes HHVM to segfault on Jessie - https://phabricator.wikimedia.org/T156923#2990468 (10hashar) **I am not available for the next two hours. Hopefully back at 8pm UTC** Updated the task details but I guess it... [17:57:56] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#2990470 (10jcrespo) [17:58:53] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: MediaWiki tests causes HHVM to segfault on Jessie - https://phabricator.wikimedia.org/T156923#2990497 (10greg) [18:02:39] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#2990505 (10jcrespo) The reason I created this ticket is because, as a DBA, I have to support some of those services bel... [18:05:31] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#2990512 (10jcrespo) [18:05:34] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#2990514 (10jcrespo) [18:05:38] 06Operations: Setup basic infrastructure services in codfw - https://phabricator.wikimedia.org/T84350#2990515 (10jcrespo) [18:05:41] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#2990511 (10jcrespo) [18:07:07] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 05codfw-rollout: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2990520 (10jcrespo) [18:07:12] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#2990470 (10jcrespo) [18:09:18] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2990528 (10jcrespo) [18:09:22] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#2990527 (10jcrespo) [18:14:15] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Setup a private elasticsearch cluster for phabricator - https://phabricator.wikimedia.org/T156939#2990532 (10Paladox) [18:14:51] PROBLEM - Check whether ferm is active by checking the default input chain on mw1259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:07] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Setup a private elasticsearch cluster for phabricator - https://phabricator.wikimedia.org/T156939#2990550 (10Paladox) p:05Triage>03High Changing to high since the db keeps crashing due to full text indexes. T156905 [18:15:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw1259 is OK: OK ferm input default policy is set [18:17:00] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Setup a private elasticsearch cluster for phabricator - https://phabricator.wikimedia.org/T156939#2990556 (10Paladox) [18:17:58] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Setup a private elasticsearch cluster for phabricator - https://phabricator.wikimedia.org/T156939#2990532 (10Paladox) Also per T155299#2975104 suggestion there. [18:29:01] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990610 (10RobH) [18:29:39] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2851729 (10RobH) a:05RobH>03Papaul Please update this task with the network port this system is plugged into. I neglected to ask you do to that via the sub task. Then assign... [18:31:14] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2990653 (10Krinkle) Initial sketch for the warmup of Memcached (... [18:32:15] 06Operations, 10ops-codfw, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990670 (10RobH) [18:32:48] PROBLEM - DPKG on db2037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:33:48] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.111 second response time [18:34:23] 06Operations, 10ops-codfw, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990675 (10Papaul) ge-5/0/10 [18:34:35] !log restbase deploy start of 96a641aa [18:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:48] RECOVERY - DPKG on db2037 is OK: All packages OK [18:35:02] 06Operations, 10ops-codfw, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990676 (10RobH) a:05Papaul>03RobH [18:36:58] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:00] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990684 (10RobH) [18:43:57] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2990719 (10Volans) For the requirements it might be helpful to s... [18:44:25] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990720 (10RobH) [18:46:43] !log joal@tin Started deploy [analytics/refinery@2b9a70a]: (no justification provided) [18:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:14] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2990763 (10RobH) [18:49:16] !log joal@tin Finished deploy [analytics/refinery@2b9a70a]: (no justification provided) (duration: 02m 33s) [18:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:04] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: MediaWiki tests causes HHVM to segfault on Jessie - https://phabricator.wikimedia.org/T156923#2990073 (10thcipriani) >>! In T156923#2990468, @hashar wrote: > To spawn an instance, Nodepool pick the youngest one. I g... [18:54:11] (03PS1) 10Ottomata: Initial deb packaging [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/335468 [18:54:36] (03PS2) 10Ottomata: Initial deb packaging [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/335468 (https://phabricator.wikimedia.org/T156821) [18:56:49] (03CR) 10Ottomata: [C: 032] Initial deb packaging [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/335468 (https://phabricator.wikimedia.org/T156821) (owner: 10Ottomata) [18:58:18] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [18:58:38] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [18:59:28] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:38] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 4149 bytes in 0.041 second response time [18:59:53] Phabricator seems down again. [18:59:58] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.122 second response time [19:00:01] "#2013: Lost connection to MySQL server at 'reading initial communication packet', system error: 0" [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T1900). Please do the needful. [19:00:06] hmmm... phab down [19:00:12] db down again? [19:00:18] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 26725 bytes in 0.282 second response time [19:00:23] #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement [19:00:32] that is because the master has crashed [19:00:36] Checking could be the same thing that happened again [19:00:36] same '#1290: The MariaDB server is running with the --read-only option so it cannot execute this statement' [19:00:37] and we do not allow automatic [19:00:38] Okay, ot [19:00:38] damn [19:00:38] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 26723 bytes in 0.448 second response time [19:00:41] paged for phab [19:00:44] change to the slave [19:00:44] and it's already back huh [19:00:46] It's back for me* [19:01:07] not here [19:01:10] Not for me, so ymmv [19:01:13] not me. [19:01:22] Uh, kinda. Tasks won't open but dashboard loads. [19:01:47] I still have nothing [19:01:52] phab is down, and it will be until db restarts [19:01:52] but that error I mean [19:01:54] works logged out [19:01:55] kk [19:01:55] twentyafterfour: we can delay our 1:1 [19:01:58] (03PS1) 10Ottomata: Install python-ua-parser as an eventlogging dependency [puppet] - 10https://gerrit.wikimedia.org/r/335470 (https://phabricator.wikimedia.org/T156821) [19:02:01] twentyafterfour: ahh [19:02:09] tx robh [19:02:23] i dont think we have a clinic duty person this week still =P [19:02:27] I can try to optimize the table on the master (it will be locked for 10 minutes) [19:02:30] none got assigned in ops meeting [19:02:38] WHich seemed to aliviate the issue on the slave [19:02:42] I actually prefer to failover to the slave [19:02:52] Sounds good to me too [19:02:58] see if it works [19:03:09] let's setup replication 48 -> 43 [19:03:37] jynus: Do we want to kill phab itself so we don't have any attempted queries? [19:03:42] while you swap [19:03:47] ostriches, set it in read only mode [19:03:50] on the app [19:03:54] Ok, lemme look [19:04:09] it is read only preciselly on protection [19:04:13] so it is not a huge deal [19:04:23] but it may be at least querable, maybe [19:04:36] (03CR) 10Ottomata: [C: 032] Install python-ua-parser as an eventlogging dependency [puppet] - 10https://gerrit.wikimedia.org/r/335470 (https://phabricator.wikimedia.org/T156821) (owner: 10Ottomata) [19:05:05] marostegui, the slave is stopped [19:05:13] did it crash? [19:05:17] or did you stop it? [19:05:22] nope, i didn't stop it [19:05:23] or it wasn't running? [19:05:30] it was running [19:05:40] it crashed [19:05:47] buff [19:05:57] not sure now if to use it [19:05:58] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:06:02] so no much point in failovering I guess [19:06:14] !log restbase deploy end of 96a641aa [19:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:32] ostriches, also prepare to rollback to the previous phab version [19:06:46] Hmm, not sure how to go read-only [19:06:46] this one crashed 2 times in the last 24 hours [19:06:52] * ostriches can't find an obvious setting [19:07:06] OOM again [19:07:18] marostegui, any tip [19:07:28] I would reload the proxy, let it crash [19:07:33] disable the failover [19:07:40] or run the alter now [19:07:41] We can try to tune the innodb_ft flags [19:07:52] but it didn't work for 48 either [19:07:53] yeah, let me run the alter now, just in case [19:07:55] at least it is done [19:08:00] marostegui: mariadb oom or oom_killer? [19:08:05] remember to bin_log = 0 [19:08:16] yep [19:08:24] !log scheduling 10 minutes of emergency downtime on phabricator [19:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:38] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2948.30 Read Requests/Sec=2637.70 Write Requests/Sec=16.70 KBytes Read/Sec=13157.20 KBytes_Written/Sec=113.20 [19:08:49] ostriches, you have 10 minutes :-) [19:09:09] running [19:09:10] but I remember twentyafterfour commented an option last tie [19:09:13] Ah, cluster.read-only [19:09:14] *time [19:09:37] jynus: twentyafterfour is coming to talk rollback options [19:09:50] we have several backups [19:10:06] bleehhh, accidental [19:10:08] !log phabricator: now in read-only mode [19:10:10] (03PS1) 10RobH: setup params for gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/335475 [19:10:11] silly client behanvior [19:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:20] does read only work better? [19:10:39] jynus: It's at least serving pages now [19:10:42] it does, to me [19:10:52] ok robh let's change that topic to read-only [19:11:00] anyone can change it ;] [19:11:03] but will do [19:11:13] sorry gnome-shell crashed [19:11:16] alter has gone 25% thru now [19:11:33] jynus: how about we just switch to elasticsearch backend? [19:11:36] !log remaining 7 minute with phabricator up, but read-only [19:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:49] twentyafterfour: I set cluster.read-only to true on phab2001 and iridium, I'm out of your way now though [19:11:54] (03CR) 10RobH: [C: 032] setup params for gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/335475 (owner: 10RobH) [19:11:56] twentyafterfour, let me confirm it was a search query [19:12:02] (03PS2) 10RobH: setup params for gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/335475 [19:12:21] for those not aware: https://phabricator.wikimedia.org/T156905 [19:12:24] if yes, I will probably agree with you [19:12:27] that is what has been discussed today [19:12:36] I have the elasticsearch back-end in good shape and the index was updated yesterday so it's ready to go really [19:12:54] twentyafterfour: i would go for that while we find a proper solution for this issue indeed [19:13:09] (if it was a select) [19:13:12] yeah, it is the same query [19:13:16] :( [19:13:18] https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1043&hours=1 [19:13:20] alter 50% done [19:13:29] let's switch to ES? [19:13:32] marostegui, I can take care of this [19:13:38] in case you are busy [19:13:53] not much to do at db level aside from that [19:14:04] jynus: so it is was a search query that crashed it? [19:14:09] jynus: thanks, I was about to have dinner. The alter is running on a screen [19:14:35] twentyafterfour: check this https://phabricator.wikimedia.org/T156905#2989405 [19:14:40] that is the query that makes it crash [19:16:32] I wonder why the slave crashed, I executed that same select two hours ago to see if the optimize would have any effect [19:16:40] twentyafterfour: Let's move to Elastic. It gets us past our immediate DB woes. Like we discussed Monday, we can just swap the logic so we can test Mysql while Elastic is primary [19:16:41] and it took "only" two minutes [19:17:00] Cuz I'm sure we'll get a bug or two ;-) [19:17:08] ostriches, the options is to keep crashing [19:17:18] I cannot disagree [19:17:24] +1 to move to ES [19:17:26] ebernhardson: ^ due to an issue causing mariadb search to crash the db hosts we need to switch to ES sooner than later [19:17:26] I'm on it [19:17:29] obvioulsy, I would have prefered to do it later [19:17:35] greg-g: Thx was just about to ping [19:17:35] ebernhardson: like now [19:17:39] XD [19:17:58] I don't think this will put too much load on elastic but I may be wrong [19:17:58] check that we do not create a worse problem [19:18:04] on elastic :-) [19:18:06] ebernhardson: I know we said we'd talk to you before we switch full on, but, we won't have a non-read-only phab until we switch at this point, afaict [19:18:13] jynus: I'll be keeping a close eye on the load. [19:18:14] Deskana: ^ [19:18:21] Deskana: see my pings to ebernhardson [19:18:28] jynus: But since we already have the index built, disk usage and such is already accounted for [19:18:48] I'm pretty damn sure compared to the wiki search traffic Phab will be barely noticeable [19:18:51] Deskana: we'll figure out the correct long term solution asap [19:18:59] ALter table is finished on db1043 [19:19:04] ok [19:19:14] then I will switch back to 43 [19:19:18] ok? [19:19:21] yep [19:19:27] go ahead [19:19:32] greg-g: If this is urgent then do whatever's necessary and we'll figure it out later. [19:19:33] we will be in phab read only, but db rw [19:20:16] Deskana: thanks, more later after the dust settles :) [19:20:29] !log reloading haproxy on dbproxy1003 [19:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:38] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [19:20:58] backend should point back to 1003 [19:21:07] we can remove 48 from rotation [19:21:29] and setup some query watchers [19:21:43] as a db-level patch [19:21:48] (03PS1) 10Dereckson: Set site name for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335478 (https://phabricator.wikimedia.org/T29878) [19:22:27] we can do that now [19:22:45] for twentyafterfour, we can put the site back in read write [19:23:11] (03PS1) 1020after4: Phabricator: enable elasticsearch backend [puppet] - 10https://gerrit.wikimedia.org/r/335480 [19:23:22] ^ there is the puppet patch to switch to elastic [19:23:50] ok to deply, then? [19:24:05] yeah +1 [19:24:13] let's go [19:24:23] (03CR) 1020after4: [C: 031] Phabricator: enable elasticsearch backend [puppet] - 10https://gerrit.wikimedia.org/r/335480 (owner: 1020after4) [19:24:27] (03CR) 10Jcrespo: [C: 032] Phabricator: enable elasticsearch backend [puppet] - 10https://gerrit.wikimedia.org/r/335480 (owner: 1020after4) [19:24:30] twentyafterfour: The read-only swaps are local with ./bin/config [19:24:35] When you're ready to turn off [19:24:42] ostriches: ok [19:24:43] (03CR) 10Jcrespo: [V: 032 C: 032] Phabricator: enable elasticsearch backend [puppet] - 10https://gerrit.wikimedia.org/r/335480 (owner: 1020after4) [19:25:18] (03PS15) 10Eevans: WIP: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) [19:25:38] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=106.70 Read Requests/Sec=163.60 Write Requests/Sec=47.50 KBytes Read/Sec=2178.00 KBytes_Written/Sec=239.20 [19:25:42] can you run puppet there or do I? [19:25:44] !log running puppet on iridium to activate the config change [19:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:50] jynus: I'm on it [19:25:53] good [19:26:47] * Dereckson is considering to deploy a config chance in SWAT, no objection? [19:26:50] greg-g: did we cut over to using teh main elastic cluster or something? [19:26:56] Dereckson: Not the ideal time [19:26:58] chasemp: in progress, yeah [19:27:05] !log disabled read-only in phabricator [19:27:07] that's that config change [19:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:29] I don't know the details so it's honestly just a someone should check, but there are dumps from there to the dumps server and iirc the elastic index has lots of sensitive data [19:27:30] twentyafterfour: so status is: now using ES but the ES index is probably catching up? [19:27:31] ostriches: ok, let's wait infrastructuer is fully operational with Phab [19:27:33] and search works [19:27:39] greg-g: right [19:27:42] just a heads up someone should verify there won't be a bad interaction there [19:27:51] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990866 (10Marostegui) This has happened again and crashed the master and the slave. I have run the ALTER table to optimize the table on the master while we were... [19:27:52] yeah [19:27:54] chasemp: +1 [19:27:58] chasemp: indeed ostriches said he'd watch elastic [19:28:06] phabricator down > wiki search down [19:28:17] Well, watching the load and such there [19:28:18] chasemp: is there something twentyafterfour should do explicitly? I forget how that's handled [19:28:31] But I don't know about dumps. Probably safest to just exclude the whole index for now [19:28:45] ebernhardson: how do the dumps from teh elastic indexes work? would this new phab index get somehow included by default? [19:28:56] the phabricator index can be rebuilt in a couple of hours so we can consider it transient, no need to back it up [19:29:08] greg-g: I don't know the mechanism just that it exists asking ebernhardson^ [19:29:09] it's not a very big index [19:29:13] * greg-g nods [19:29:13] the thing is [19:29:13] thanks greg-g [19:29:24] db1043 crashing would be normal [19:29:34] but we thought we had mitigated it on db1048 [19:29:44] db1048 crashing was not normal [19:29:46] yeah, tomorrow I will run the query again [19:29:54] Because it didn't crash two hours ago, it is strange [19:30:10] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990885 (10Paladox) p:05High>03Unbreak! Due to it taking down the db + forcing us to switch to elastic search. I am upping the priority to unbreak. [19:30:27] Not even so much as a blip in elastic's loads. [19:30:36] I'm still digging in the code to make 100% sure that the query won't happen with the elastic backend (only fulltext runs on elastic, the rest of the search queries still go to mysql) [19:31:23] yeah I'm sure load wise it's nothing at all [19:31:42] paladox: not really needed now that we've mitigated the issue [19:31:47] I knew the index size was nothing (and we'd already indexed, so we wouldn't bump in disk usage) [19:31:56] Just confirming my suspicions so nobody can say otherwise :) [19:32:38] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:46] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2990912 (10greg) p:05Unbreak!>03High Issue has been mitigated (we're using the ES backend for everyone now), lowering priority but keeping open for any follow... [19:32:52] greg-g oh sorry. [19:33:06] ostriches: yep [19:33:40] only thing that scares me is soemhow all of the indexed procurement data ending up on dumps.wikimedia.org [19:33:43] paladox: also, we're in talks with discovery re the right way forward with the ES servers, let us handle the hardware/setup requests for that, please (and in the future, please don't make those kind of tasks without talking with us first). [19:34:02] 06Operations: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#2990919 (10RobH) [19:34:03] it only creates more confusion [19:34:11] ok sorry. [19:34:21] thanks for understanding [19:34:31] Deskana, greg-g: We should have a meeting to figure out Plans™ [19:34:39] ooo, another meeting! [19:34:39] Shouldn't be too complicated [19:34:45] (03CR) 10Eevans: "I set `jmx_exporter_enabled: true` on xenon in patch #15, and ran the Puppet compiler: http://puppet-compiler.wmflabs.org/5308/ (the outpu" [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [19:35:18] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 0 down 0 [19:35:23] (03PS1) 10Urbanecm: Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) [19:35:51] (03PS16) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) [19:36:19] I think this is the query that crashed mysql, running fast on elastic: https://phabricator.wikimedia.org/search/query/vxuL2mdaDY_q/#R [19:37:11] (03PS17) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) [19:37:53] who-hoo, Phab works again. Thank you all for fixing it :D [19:38:02] chasemp: It looks like there are no elasticsearch dumps on dumps.wikimedia.org ... at least none that I can see [19:38:18] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [19:38:19] (03PS1) 10RobH: gerrit2001 dns update [dns] - 10https://gerrit.wikimedia.org/r/335483 [19:38:35] rfarrand: :) yeah thanks to jynus, marostegui, ostriches for a quick response. :) [19:39:09] twentyafterfour: have to ask dcausse or edsanders I'm not sure what the deal is tbh [19:39:09] Well, that bumped that timeline up a lot... [19:39:13] Move to elastic? Soon -> Now [19:39:14] I am sure if the engineering dept were not empty today in SF, people would be cheering! hehe :) [19:39:31] rfarrand: :) :) [19:39:50] let me reload 1008 too [19:40:15] BTW, rfarrand I saw you asking before- normally we update the topic here about outages [19:40:47] (03CR) 10RobH: [C: 032] gerrit2001 dns update [dns] - 10https://gerrit.wikimedia.org/r/335483 (owner: 10RobH) [19:41:32] jynus: Thanks, yeah, normally I would check. But I just originally assumed I was making the mistake - not that the whole thing was down. [19:41:51] he he [19:42:15] I would make a joke about phabricator being down [19:42:18] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [19:42:48] twentyafterfour: there are elasticsearch dumps there, https://dumps.wikimedia.org/other/cirrussearch/ [19:42:53] but do not want to offend anyone, specially because maintainers work very hard [19:42:59] hehe [19:43:01] twentyafterfour: they are done per-wiki though, so phab shouldn't somehow become accidentally included [19:43:20] ebernhardson: opt-in >>>> opt-out, yay [19:43:20] and original developers help a lot [19:43:43] yeah, evan is awesome, did you notice he replied at 6am his local time? :) [19:44:16] its kind of exciting when something breaks and everyone has to stop everything and fix it... just saying. Maybe not as exciting for the people who have to fix it though. [19:44:24] he is a very early riser almost always [19:44:26] rfarrand, exciting? [19:44:36] but operations is supposed to be boring, very boring. Thats when the job was done right :P [19:44:41] rfarrand, do you want to move to ops? [19:44:48] well, like everyone is cancling meetings and talking about the same thing and its almost like a snow day [19:44:50] jynus: dont' forget, rfarrand is a Search and Rescue person, she gets really excited/high from this stuff :) [19:44:54] I will delegate my pages :-) [19:44:59] jynus: just convince mark to hire me! [19:45:57] (03PS18) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [19:46:28] (03CR) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [19:47:55] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2990999 (10Gilles) 05Open>03Resolved [19:48:05] I will now start cleaning up all ashes [19:48:17] (03Abandoned) 10Gilles: Configure SMTP for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/328673 (https://phabricator.wikimedia.org/T153167) (owner: 10Gilles) [19:48:36] I think manuel started db1048 replication [19:48:53] phab2001 is complaining a bit [19:48:57] twentyafterfour: jynus: this is probably worthy of an incident report as well [19:49:02] yeah [19:49:15] but I have yet to write a previous one [19:49:18] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:49:21] ebernhardson: thanks for the update on the dumps [19:49:29] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2991006 (10Gilles) 05Open>03Resolved [19:49:31] jynus: I can write this one [19:49:42] I'll just fill in the details from the phab task [19:49:55] that would help a lot [19:50:05] :) [19:50:28] Thank you all for fixing Phab :) [19:50:32] can you check 2001? [19:50:37] it is complaining [19:50:40] (03CR) 10Mobrovac: "> IMO on balance these two things don't offset introducing (and having to maintain) our cron resource" [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [19:50:45] jynus: ok [19:51:05] maybe phab tried to start there? [19:52:01] it may be unrelated [19:52:10] twentyafterfour: is there a debug mode for phab search where i can see the queries it's generating for search? trying to get an idea of if the 2.x -> 5.x upgrade is going to be a problem [19:52:36] ebernhardson: the 5.x upgrade should not be a problem ... [19:52:40] but I can send you the queries [19:52:49] twentyafterfour: hopefully, the thing is elasticsearch always breaks the api between major versions [19:52:53] I have the debug statements commented out in prod [19:52:59] ebernhardson we have elasticsearch 5 on phab-01, currently no problems. [19:53:04] ok [19:53:16] but not full normal user usage [19:53:20] usual caveat [19:53:26] so I'm not sure why phab2001 says phd should be running (in icinga) [19:53:33] but we do get warnnings about the index [19:53:48] (03PS1) 10MarcoAurelio: FlaggedRevs user group name changes for fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335487 (https://phabricator.wikimedia.org/T156942) [19:53:51] never mind, warnning gone now. [19:53:56] twentyafterfour, maybe it is WIP [19:54:08] if unrelated, do not spend time on it [19:55:08] jouncebot: now [19:55:09] For the next 0 hour(s) and 4 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T1900) [19:55:19] can I have a patch for swat? [19:55:28] it's a tiny config change [19:55:44] sorry for being in the last 4 minutes [19:55:47] (03PS1) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 [19:56:12] who's doing SWAT fwiw? [19:56:30] TabbyCat: it was not a good time when it started, swat was skipped [19:56:37] (03CR) 10jerkins-bot: [V: 04-1] toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (owner: 10Madhuvishy) [19:56:42] phabricator was down/read-only [19:56:51] greg-g: I can do a swat patch now if it's ok with you [19:56:57] greg-g: oh, yep; wasn't there so I didn't knew [19:56:58] sure, if you want [19:57:03] train should be easy today [19:57:11] It'd be https://gerrit.wikimedia.org/r/#/c/335487 [19:57:16] TabbyCat: looking [19:57:19] I can post at Wikitech [19:57:24] twentyafterfour: dude, don't jinx yourself! [19:57:56] * twentyafterfour isn't superstitious [19:58:40] (03CR) 1020after4: [C: 032] FlaggedRevs user group name changes for fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335487 (https://phabricator.wikimedia.org/T156942) (owner: 10MarcoAurelio) [19:59:14] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2991046 (10Krinkle) a:03Krinkle [19:59:31] !log Freshening phabricator's elasticsearch index, currently 50% complete [19:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:47] !log scheduled downtime in icinga for phab2001's phd service [19:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T2000). [20:00:17] (03PS2) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [20:00:19] (03Merged) 10jenkins-bot: FlaggedRevs user group name changes for fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335487 (https://phabricator.wikimedia.org/T156942) (owner: 10MarcoAurelio) [20:00:20] jouncebot: hang on I'm swatting :P [20:00:38] (03CR) 10jenkins-bot: FlaggedRevs user group name changes for fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335487 (https://phabricator.wikimedia.org/T156942) (owner: 10MarcoAurelio) [20:00:38] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:00:40] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: MediaWiki tests causes HHVM to segfault on Jessie - https://phabricator.wikimedia.org/T156923#2991055 (10hashar) https://gerrit.wikimedia.org/r/#/c/323401/ had two builds run Failed https://integration.wikimedia.org... [20:01:14] (03CR) 10jerkins-bot: [V: 04-1] toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [20:03:35] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#2991069 (10aaron) a:05Joe>03aaron [20:04:38] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3335.40 Read Requests/Sec=3225.10 Write Requests/Sec=2.30 KBytes Read/Sec=30180.00 KBytes_Written/Sec=66.40 [20:05:33] (03PS3) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [20:05:50] TabbyCat: patch deployed, does that fix the problem? [20:06:04] wait no it didn't deploy [20:06:06] hang on [20:06:10] twentyafterfour: deployed and live or mwdebug [20:06:13] ah that [20:06:26] I saw no !_log :) [20:06:29] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2990073 (10hashar) [20:06:30] Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 88% OVER_THRESHOLD (Avg. Error rate: Before: 0.01, After: 1.00, Threshold: 0.11) [20:06:32] (03CR) 10jerkins-bot: [V: 04-1] toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [20:07:10] greg-g: When you get the chance, can you send a quick email to discovery@lists.wikimedia.org detailing what happened, so the engineers can take a look? Not urgent. [20:08:28] TabbyCat: maybe it was an anomaly but that looks like a big canary failure [20:08:57] (03PS4) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [20:08:59] twentyafterfour: scap sync wmf-config/InitialiseSettings.php ? [20:09:31] twentyafterfour: hrm this may be have to with the small initial value of the error rate [20:09:33] also, that mw1278 doesn't sound right for me, they usually do that on mwdebug1002 [20:09:37] 06Operations, 10netops: asw-d-codfw public1-vlan addition review (blocks gerrit2001) - https://phabricator.wikimedia.org/T156957#2991089 (10RobH) [20:10:22] (03CR) 10Zhuyifei1999: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [20:10:32] !log continuing mariadb rolling restart of db2044, db2051, db2058, db2065 [20:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:50] 06Operations, 10netops: asw-d-codfw public1-vlan addition review (blocks gerrit2001) - https://phabricator.wikimedia.org/T156957#2991089 (10RobH) [20:12:00] 06Operations, 10netops: asw-d-codfw public1-vlan addition review (blocks gerrit2001) - https://phabricator.wikimedia.org/T156957#2991113 (10RobH) Also I assumed Faidon, but if this wasn't best, and I should leave netops tasks unassgined in that project for triage, just let me know! [20:13:12] TabbyCat: that patch only touched flaggedrevs.php [20:13:36] eeep, yep, twentyafterfour, too used to IS.php [20:14:05] so it should be that file scap-sync wmf-config/flaggedrevs.php I think [20:14:26] not sure though, I'm no deployer [20:14:31] yeah I'm gonna try it one more time [20:14:38] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=6.40 Read Requests/Sec=0.00 Write Requests/Sec=1.00 KBytes Read/Sec=0.00 KBytes_Written/Sec=13.60 [20:15:14] !log twentyafterfour@tin Synchronized wmf-config/flaggedrevs.php: deploy I1683b184fbf4c0c6fb0d4dad1fcde3a253a30cb4 refs T156942 (duration: 00m 40s) [20:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:18] T156942: Two "autopatrol" user group in Persian Wikipedia - https://phabricator.wikimedia.org/T156942 [20:15:34] TabbyCat: done [20:15:38] (03CR) 10Mobrovac: [C: 031] Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [20:15:45] :D [20:15:47] checks [20:15:49] I guess it was an anomaly [20:16:42] hmm, very weird [20:16:47] it's still there [20:16:56] so it must be comming from somewhere else as well [20:16:58] must be defined somewhere else as well [20:16:59] heh [20:17:02] xD [20:17:26] (03CR) 10Zhuyifei1999: "Sorry I guess I should have removed the comments on PS2 first." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [20:17:41] I guess I'll have to 'grep' the whole config and see where that is comming from [20:17:44] sigh [20:18:18] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:19:16] hah, found it [20:19:23] on abusefilter.php [20:23:15] (03PS1) 10MarcoAurelio: Further renaming 'autopatrol' for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335491 [20:23:26] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2991147 (10Fjalapeno) >>! In T66214#2988556, @GWicke wrote: >>>! In T66214#2988550, @Tgr wrote: >>>>! In T66214#2982093, @Fjalapeno wrote: >>> One piec... [20:24:05] twentyafterfour: https://gerrit.wikimedia.org/r/335491 ;) :? [20:24:45] (03CR) 1020after4: [C: 032] Further renaming 'autopatrol' for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335491 (owner: 10MarcoAurelio) [20:25:59] ostriches: I'll schedule something... [20:26:21] Or just figure it out on Phab, if we have tasks [20:26:22] :) [20:26:38] (03Merged) 10jenkins-bot: Further renaming 'autopatrol' for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335491 (owner: 10MarcoAurelio) [20:26:48] I like talking :-p [20:27:01] Meetings need not be evil if there is an agenda so that they don't just fill the allotted time [20:27:31] Or, well, may be a necessary evil instead of an unnecessary one :-p [20:27:39] (03CR) 10jenkins-bot: Further renaming 'autopatrol' for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335491 (owner: 10MarcoAurelio) [20:28:19] (03CR) 10Zhuyifei1999: [C: 04-1] "-1 because the crontab's path is wrong..." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [20:29:17] !log twentyafterfour@tin Synchronized wmf-config/abusefilter.php: deploy I0b4e02714ea0d99da96e4ca1ce7de2a1a9552791 refs T156942 (duration: 00m 39s) [20:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:21] T156942: Two "autopatrol" user group in Persian Wikipedia - https://phabricator.wikimedia.org/T156942 [20:29:30] TabbyCat: ^ [20:30:34] pff [20:30:59] still not gone but only have abusefilter rights we've just removed [20:31:26] greg-g: talked about jinxing... it's happening [20:31:36] lol [20:32:23] TabbyCat: I'm going to deploy wmf.10 to group1, I can sync another config change if you find it elsewhere [20:33:57] twentyafterfour: go ahead, please - I've already taken enough of your time [20:34:05] very sorry [20:34:25] TabbyCat: no problem at all [20:34:49] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335494 [20:34:51] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335494 (owner: 1020after4) [20:34:54] (03CR) 10Hashar: [C: 031] "All fine to me / good enough. The thresholds levels should not cause any false alarm." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [20:36:19] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335494 (owner: 1020after4) [20:36:22] (03PS7) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) [20:36:52] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.10 [20:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:41] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335494 (owner: 1020after4) [20:37:49] (03CR) 10Zhuyifei1999: [C: 04-1] toollabs: Add temp role and cron to send weekly tools precise deprecation reminders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [20:38:15] (03CR) 10Paladox: varnish misc: add phab2001 as a backend for phab-new (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:38:48] Function: ApiQueryAllUsers::execute Error: 1054 Unknown column 'ug_expiry' in 'where clause' [20:39:34] let me guess [20:39:37] Schema change was required or code not made robust to work without it for now? [20:39:43] someone enabled a functionality [20:39:50] Until schema change is applied [20:39:50] didn't wait for it to be deployed [20:40:00] something like that :-/ [20:40:01] I saw it in update.php [20:40:08] this looks like it's coming from wmf.10 [20:40:22] But assume the code works ok without schema change [20:40:29] Yes it's new afaik [20:40:30] (03CR) 10Tjones: Deploy TextCat Improvements (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) (owner: 10Tjones) [20:40:49] Is it not guarded behind a new $wg? [20:40:53] And we've not set it to false? [20:40:58] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5263.13 seconds [20:41:06] that is mee [20:41:08] ignore [20:41:12] Seems so [20:41:18] file": "/srv/mediawiki/php-1.29.0-wmf.10/includes/api/ApiQueryAllUsers.php", "line": 211, [20:41:20] $wgDisableUserGroupExpiry = false; [20:41:22] ack expiration exiired [20:41:29] That needs adding to CommonSettings.php [20:41:32] because I was wtith phab [20:41:44] (03PS8) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) [20:41:55] * Reedy makes a patch [20:42:10] I thought TTO already did that? [20:42:17] legoktm: Presumably not merged? [20:42:26] It's not in IS or CS [20:42:42] (03CR) 10Paladox: [C: 031] "Looks all good @bblack would you be able to review please?" [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:42:44] (03PS2) 10Tjones: Deploy TextCat Improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) [20:42:48] https://gerrit.wikimedia.org/r/#/c/332721/ [20:43:16] so it is [20:43:56] reedy@tin:~$ mwscript eval.php enwiki [20:43:57] > var_dump( $wgDisableUserGroupExpiry ); [20:43:57] bool(true) [20:44:23] https://gerrit.wikimedia.org/r/#/c/335026/ is in wmf.10 [20:44:45] Dereckson: ping [20:45:07] pong [20:45:43] Dereckson: I can't find where the 'autopatrol' 'autopatrolled' dicotomy is comming from now [20:45:59] wtf: Syntax Warning: Can't create transform [20:46:08] Reedy: working on a patch [20:46:10] well, for historical legends pre-2012, you can ask Reedy [20:46:21] and ... Syntax Warning: Couldn't link the profiles [20:46:30] we did sync. flaggedrevs.php and abusefilter.php which both contained the wrong name and deployed, but it's still there Dereckson [20:46:43] TabbyCat: I imagine it's to disambiguate between the right and the group? [20:46:52] legoktm: thanks! [20:47:09] Dereckson: 'autopatrol' as group name should not exist, but 'autopatrolled' [20:47:30] (03PS3) 10Tjones: Deploy TextCat Improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) [20:47:36] I got 10,000 each of those syntax warning messages but then they went away [20:47:47] I'm leaving now, will try to have a look later [20:47:48] flaggedrevs.php has some $wgFlaggedRevsRestrictionLevels = [ '', 'autoconfirmed', 'autopatrol', 'review' ]; [20:47:53] after just about 2 minutes [20:48:16] Dereckson: but that's not the problem, these are right [20:48:36] $wgGroupPermissions['autopatrol']['autoreview'] = true; [20:48:41] (for fa.wikipedia) [20:48:45] (ligne 365) [20:48:46] Dereckson: https://phabricator.wikimedia.org/T156942 [20:48:50] hm? [20:49:31] this is the problem https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%98%D9%87:%D8%A7%D8%AE%D8%AA%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA_%DA%AF%D8%B1%D9%88%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1%DB%8C#autopatrol [20:50:39] $wgGroupPermissions['autopatrolled']['autoreview'] = true; <--- that's what's currently [20:50:50] anyway, brb in ~20 mins. [20:50:53] Reedy: [12:50:39] (PS1) Legoktm: API: Guard more ug_expiry queries with $wgDisableUserGroupExpiry checks [core] - https://gerrit.wikimedia.org/r/335496 [20:53:00] twentyafterfour: Patches in the jenkins workflow [20:53:47] Reedy: thanks [20:53:55] shall I deploy once merged? [20:53:57] legoktm made them [20:54:02] Yeah, it would be a good idea :) [20:54:09] See how long jerkins takes [20:55:00] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [20:55:15] I see https://gerrit.wikimedia.org/r/#/c/335496/ was there another? [20:55:49] I mean patches, as in in master and .10 [20:55:57] ahh [20:56:18] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:07] (03PS5) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [20:57:55] damn it I clicked cherry pick on the original patch and that somehow created a patchset #2 on the cherrypick [20:58:02] stupid gerrit [20:58:40] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on mw1017, mw1099 - https://phabricator.wikimedia.org/T150912#2991296 (10Dereckson) 05Invalid>03Open Reopeing, as we've the same issue on mwdebug1002. [20:58:47] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#2991298 (10Dereckson) [20:59:30] (03PS6) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [20:59:36] twentyafterfour: uh....that's what the cherry-pick button does? :P [20:59:47] you'd think, the same patch onto the same HEAD... [20:59:52] It would be a noop [20:59:58] commit metadata is different :P [20:59:59] you'd think [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T2100). [21:00:11] it should have warned me at least :-/ [21:00:20] oh well [21:00:27] Add it to the gerrit sucks list [21:00:29] halfak: should we deploy? [21:00:32] (03CR) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [21:03:00] (03CR) 10Madhuvishy: "Thanks for the CRs Zhuyifei1999, I've changed most of the syntax/typo/email content things, but not making the optimization changes right " [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [21:03:10] master patch merged [21:10:02] (03PS7) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [21:12:28] (03PS8) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [21:15:13] twentyafterfour: It's merged [21:19:59] twentyafterfour: for https://phabricator.wikimedia.org/T156942#2991255 try to sync too InitialiseSettings.php to trigger a configuration cache update? [21:20:49] twentyafterfour: I'm not sure new configuration is picked with only abusefilter/flaggedrevs update [21:21:04] !log reedy@tin Synchronized php-1.29.0-wmf.10/includes/api/: Guard more ug_expiry queries (duration: 00m 48s) [21:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:28] onit [21:23:49] Dereckson: ok I'll try that [21:25:18] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:27:41] (03CR) 10BryanDavis: "a few wording suggestions" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [21:28:20] (03CR) 10Zhuyifei1999: "Okay :)" [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [21:30:22] (03PS1) 10Ottomata: Clean up analytics icinga contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/335523 [21:32:10] (03PS9) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [21:33:12] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2991384 (10hashar) beta cluster and CI can definitely benefit from a newer git version. Can't it be pushed to `jessie-wikimedia/backports` and then... [21:33:13] (03CR) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [21:38:53] (03CR) 10Ottomata: [C: 032] Clean up analytics icinga contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/335523 (owner: 10Ottomata) [21:39:22] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2991424 (10Krinkle) [21:39:33] !log twentyafterfour@tin Synchronized wmf-config/InitialiseSettings.php: sync InitializeSettings to activate change from previous patches refs T156942 (duration: 00m 41s) [21:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:40] T156942: Two "autopatrol" user group in Persian Wikipedia - https://phabricator.wikimedia.org/T156942 [21:42:20] Dereckson and twentyafterfour hi again - I saw you sync. Inisetings but still that group remains with abusefilter rights -- maybe a full sync of the whole wmf-config/ folder would solve it? [21:43:48] config at https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/abusefilter.php;8cb98dc0f6a4456bb57067ed89692fd5b712a151$162 is correct [21:43:55] not sure why it's not inheriting [21:46:10] !log bsitzmann@tin Started deploy [mobileapps/deploy@09101f7]: Update mobileapps to e48a88c [21:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:14] !log bsitzmann@tin Finished deploy [mobileapps/deploy@09101f7]: Update mobileapps to e48a88c (duration: 03m 04s) [21:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:43] PROBLEM - MariaDB Slave Lag: s2 on db1067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 13804.75 seconds [21:52:34] (03Abandoned) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [21:54:03] paged for db1067, known/expected? [21:54:57] (03PS1) 10Tim Landscheidt: Fix typo in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/335548 [21:56:09] looks like stats just came back [21:57:18] (03CR) 10BryanDavis: [C: 031] toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [21:57:24] checking mw-config to see if it is pooled [21:58:58] looks like it wasn't receiving traffic before today according to https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1067&from=now-24h&to=now [21:59:45] (03PS10) 10Madhuvishy: toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) [21:59:57] ah just got reimaged to jessie, perhaps it is expired downtime on db1067, jynus ? [22:00:14] (03CR) 10Madhuvishy: [V: 032 C: 032] toollabs: Add temp role and cron to send weekly tools precise deprecation reminders [puppet] - 10https://gerrit.wikimedia.org/r/335488 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [22:01:10] yes, that's it [22:01:39] 06Operations, 10ops-eqiad, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991469 (10Dzahn) [22:01:43] I Am doing 7 things at the same time [22:01:50] 06Operations, 10ops-eqiad, 10Phabricator, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991483 (10Dzahn) [22:02:03] because I have 7 people telling me their change is top priority [22:02:10] 06Operations, 10ops-eqiad, 10Phabricator, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991485 (10Dzahn) [22:02:15] 06Operations, 10Phabricator, 06Release-Engineering-Team: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#2839436 (10Dzahn) [22:02:27] sounds like spinning plates! [22:02:41] how long should we downtime it for? [22:03:05] I will disable alerts [22:03:10] so it doesn't page again [22:03:55] sounds good, thanks! [22:04:39] 06Operations, 10ops-eqiad, 10Phabricator, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991492 (10Dzahn) The advantages here (over reinstalling iridium): - less maintenance/downtime for end users - don't have to fix everything at once in... [22:04:40] godog: hey! [22:04:43] am updating lots of docs [22:04:44] saw [22:04:47] rcstream diamond_collector.py  [22:04:47] Parse RCStream stats from localhost:10080 [22:04:50] https://wikitech.wikimedia.org/wiki/Prometheus [22:04:51] what is that? :) [22:05:20] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:05:39] ottomata: hey! I have no idea :) [22:05:50] besides what it says on the tin heh [22:08:45] !log deploying schema change to page_assessments_projects on testwiki T156305 [22:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:51] T156305: Update page_assessments_projects schema for subprojects in production - https://phabricator.wikimedia.org/T156305 [22:10:45] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991504 (10Dzahn) procurement ticket for iridium was https://rt.wikimedia.org/Ticket/Display.html?id=6772 they (12 misc s... [22:10:50] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991506 (10Paladox) [22:10:59] (03PS1) 10Madhuvishy: toollabs: Fix precise_reminder script file name [puppet] - 10https://gerrit.wikimedia.org/r/335550 [22:11:35] (03CR) 10Madhuvishy: [V: 032 C: 032] toollabs: Fix precise_reminder script file name [puppet] - 10https://gerrit.wikimedia.org/r/335550 (owner: 10Madhuvishy) [22:13:54] hah ok! [22:14:55] (03PS1) 10Jcrespo: Add replication client grants to phuser [puppet] - 10https://gerrit.wikimedia.org/r/335551 [22:15:57] Ohhh godog, its just a monitoring script for the rcstream service [22:16:00] cool, not an rcstream consumer [22:16:39] ottomata: yeah, I found it when auditing all diamond plugins a while back, didn't even know it existed [22:21:45] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991554 (10RobH) a:03mark iridium's (current eqiad phab host) specs: * 1U system * Dual Intel Xeon CPU E5-2450 v2 @ 2.5... [22:24:34] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#2991562 (10mmodell) [22:24:51] (03CR) 1020after4: [C: 031] Add replication client grants to phuser [puppet] - 10https://gerrit.wikimedia.org/r/335551 (owner: 10Jcrespo) [22:25:43] (03PS2) 10Jcrespo: Add replication client grants to phuser [puppet] - 10https://gerrit.wikimedia.org/r/335551 [22:27:10] so I've been told that I ask you to do a touch and sync to fix that fawiki issue as it's probably a caching one? [22:27:51] (03CR) 10Jcrespo: [C: 032] Add replication client grants to phuser [puppet] - 10https://gerrit.wikimedia.org/r/335551 (owner: 10Jcrespo) [22:29:41] (03PS1) 10Jcrespo: Revert "Add replication client grants to phuser" [puppet] - 10https://gerrit.wikimedia.org/r/335554 [22:30:04] 06Operations, 10ops-esams, 10hardware-requests: reclaim hooft to spares - https://phabricator.wikimedia.org/T131560#2991567 (10RobH) a:05RobH>03None This will need to be wiped by onsite actually. So it has to go to @mark or just have #ops-esams for the wipe step. [22:30:56] 06Operations, 10ops-esams, 10hardware-requests: reclaim hooft to spares - https://phabricator.wikimedia.org/T131560#2991571 (10RobH) [22:31:29] @seen hashar [22:31:29] mutante: Last time I saw hashar they were quitting the network with reason: Remote host closed the connection N/A at 2/1/2017 9:09:54 PM (1h21m34s ago) [22:32:45] (03PS2) 10Jcrespo: Revert "Add replication client grants to phuser" [puppet] - 10https://gerrit.wikimedia.org/r/335554 [22:33:39] (03CR) 10Jcrespo: "Are you sure phuser is the right user, and not phadmin?" [puppet] - 10https://gerrit.wikimedia.org/r/335554 (owner: 10Jcrespo) [22:34:20] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [22:38:59] (03PS1) 10Madhuvishy: toollabs: Fix precise reminder block indent [puppet] - 10https://gerrit.wikimedia.org/r/335557 [22:40:28] (03PS1) 10Thcipriani: logstash_checker: provide an absolute threshold [puppet] - 10https://gerrit.wikimedia.org/r/335558 [22:40:47] (03CR) 10Madhuvishy: [C: 032] toollabs: Fix precise reminder block indent [puppet] - 10https://gerrit.wikimedia.org/r/335557 (owner: 10Madhuvishy) [22:43:19] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2991626 (10Dzahn) 05stalled>03declined see reason above, we can't do this yet. Instead we are requesting new hardware for phab1001 (T156970) to test wi... [22:46:52] !log dereckson@tin Synchronized wmf-config/: Folder sync to get around caching issue in previous deployments (T156942) (duration: 00m 45s) [22:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:58] T156942: Two "autopatrol" user group in Persian Wikipedia - https://phabricator.wikimedia.org/T156942 [22:47:11] (03CR) 10Ottomata: "Not sure where exactly wqds_extract comes from, but would it be possible to use the existent refinery-drop-hourly-partitions script?" [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [22:49:24] TabbyCat: did you check autopatrol group is empty by the way? [22:51:49] Dereckson: of course, since it was renamed few weeks ago [22:52:04] and migrateUserGroups.php was run as well [22:52:16] All in order so. [22:52:48] it was empty today as well [22:53:00] thank you [22:54:51] !log carbon - rsyncing /srv/ data to install1002 (T132757) [22:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:56] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [23:13:10] (03PS1) 10Madhuvishy: toollabs: Better exception handling for precise-reminder [puppet] - 10https://gerrit.wikimedia.org/r/335560 [23:14:18] (03CR) 10jerkins-bot: [V: 04-1] toollabs: Better exception handling for precise-reminder [puppet] - 10https://gerrit.wikimedia.org/r/335560 (owner: 10Madhuvishy) [23:16:52] (03PS2) 10Madhuvishy: toollabs: Better exception handling for precise-reminder [puppet] - 10https://gerrit.wikimedia.org/r/335560 [23:18:13] (03PS1) 10Gergő Tisza: Do not throw away $wgRateLimitsExcludedIPs defaults when there is a wiki-specific setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335561 (https://phabricator.wikimedia.org/T87841) [23:19:09] (03PS2) 10Gergő Tisza: Do not throw away $wgRateLimitsExcludedIPs defaults when there is a wiki-specific setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335561 (https://phabricator.wikimedia.org/T87841) [23:23:43] (03CR) 1020after4: [C: 031] logstash_checker: provide an absolute threshold [puppet] - 10https://gerrit.wikimedia.org/r/335558 (owner: 10Thcipriani) [23:27:40] (03CR) 10Madhuvishy: [C: 032] toollabs: Better exception handling for precise-reminder [puppet] - 10https://gerrit.wikimedia.org/r/335560 (owner: 10Madhuvishy) [23:31:20] (03PS1) 10BryanDavis: Ignore lighttpd-precise in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/335569 (https://phabricator.wikimedia.org/T94792) [23:48:05] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2991826 (10jcrespo) removing m3 from dbstore2001: db1043-bin.001457:753455796 [23:48:10] RECOVERY - MariaDB Slave Lag: m3 on dbstore2001 is OK: OK slave_sql_lag not a slave [23:50:22] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2991827 (10jcrespo) ``` sudo salt -C 'G@cluster:mysql' cmd.run 'mysql --skip-ssl -e "SELECT @@ssl_ca"' | grep -c 'Puppet' 130 ```