[00:00:35] (03PS5) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [00:04:35] Krenair: in what sense? :) [00:05:17] honestly, the state of varnish (and even worse, the state of nginx) in the beta cluster is pretty awful right now. it needs some love and updates. but most of the varnish part mostly works. [00:05:27] bblack, I'm wondering where the equivalent of templates/varnish/misc.inc.vcl.erb is [00:05:46] Wanted to see if I could hook up noc.wikimedia.beta.wmflabs.org to test a change I was planning to propose to it [00:06:04] I don't think there's a misc-web equivalent in beta at all [00:06:17] the closest thing might be the domainproxy [00:06:25] what kind of change? [00:06:31] so you can get a new one by clicking in the wikitech ui [00:06:53] bblack, apache config to redirect some URLs I wanted to remove [00:07:06] but it's not gonna be the same to test. yea, what type of change [00:07:08] ah [00:07:23] so the apache config should still be testable then [00:07:23] yeah there's no real test for that. codereview and get it right and be ready to revert :) [00:07:35] mutante: there's no noc.beta though [00:08:00] yes, you could only get noctest.wmflabs.org [00:08:06] noc.wmflabs.org even [00:08:14] I don't know how useful a noc.beta would be anyways, if we're not pointing whatever random automated things at it that might pull that data in production. [00:08:36] to be honest, I'm really not even sure what is left that does hit noc [00:08:42] pybal hasn't for a long time [00:08:59] it's basically just a landing page [00:09:02] that links to other tools [00:09:10] nowadays [00:09:45] pybal = config-master now [00:10:01] well conf/, and db.php [00:10:17] those look like things that there might be an 0.01% chance some automated tool still hits or relies on somehow [00:10:18] yes, true. db.php i found by accident :) [00:10:35] the DBAs didnt know it, it's kind of duplicate of https://dbtree.wikimedia.org/ [00:10:47] yeah, I ran into that the other day [00:10:58] i just linked it from noc [00:11:07] also https://noc.wikimedia.org/info.php [00:11:12] we can of course just look at raw traffic on noc.wm.o to see what's still hitting it [00:11:20] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446833 (10Smalyshev) @Dzahn yes but your proposed config as far as I can see allows only to sudo to user blazegraph, not to root, for services... [00:12:13] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1438123 (10Smalyshev) [00:13:29] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446848 (10Dzahn) @Smalyshev the first proposal had ALL users, i just amended to blazegraph after your comment. you are probably right and you... [00:15:25] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446858 (10Smalyshev) @Dzahn I'm talking about service logs, like reporting that everything started/stopped/loaded/etc. properly and the progre... [00:18:29] (03PS3) 10Dzahn: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) [00:30:45] !log logstash1004 fully recovered all shards [00:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:48] looks like that took ~4 hours [00:31:54] time to try the next one [00:32:38] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446898 (10Dzahn) I don't know the answer to the logging question yet. Looking at the parsoid role i see a comment "132 # until logging is... [00:34:28] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446899 (10Smalyshev) OK, maybe then we can temporarily add journalctl to the sudo list until we figure out how centralized logging will work?... [00:34:57] !log rebooting logstash1005 [00:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:27] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [00:36:27] PROBLEM - Host logstash1005 is DOWN: PING CRITICAL - Packet loss = 100% [00:37:18] RECOVERY - Host logstash1005 is UPING OK - Packet loss = 0%, RTA = 2.38 ms [00:37:19] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446901 (10Dzahn) Yes, that sounds good for now until we figure out something else. [00:39:17] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 17.86% of data above the critical threshold [100000000.0] [00:46:11] !log Upgraded Elasticsearch to 1.6.0 on logstash1005; replicas recovering now [00:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:48:10] mutante: SAL is getting a new section for every edit now [00:49:14] yeah [00:49:45] sounds like https://gerrit.wikimedia.org/r/#/c/223046/12/adminlog.py is to blame [00:49:50] line 42 [00:51:08] (03PS1) 10John F. Lewis: mail: hiera-ise mailman and lists [puppet] - 10https://gerrit.wikimedia.org/r/224210 [00:51:12] I think what we should actually do is: year, month, day = split(" ")[1].split("-") [00:51:29] uh, but obviously with a "line." before the first split [00:52:11] there must be a later change. the header on line 46 is not what's being written [00:52:19] but I think your right about the basic bug [00:52:26] robh: can I have a merge for https://gerrit.wikimedia.org/r/#/c/224208/ ? labs/private, adding a dummy key [00:55:46] JohnFLewis: can you hold on that a sec [00:56:14] anything to do with private-repo file paths is going to push things further out of sync with some ongoing changes in prod... [00:56:15] bblack: I can hold on to it for a while :) [00:57:15] thanks! [00:57:35] bblack: though ideally that being merged allows the hiera patch to be merged so the labs project can actually run mailman not-hackish which would be great for as soon as but I've seen the changes so, sure [00:58:35] no, I get it. [00:59:02] let me fix up the labs-private repo to match what's going on in prod at least, and then you can rebase on another upcoming commit, etc... [01:04:00] JohnFLewis: https://gerrit.wikimedia.org/r/#/c/224211/1/files/README (and the rest of that commit) [01:05:45] (also, recently the puppet code for that key already moved the location into that modules/secret/secrets/ hierarchy as well, for e.g. prod) [01:06:31] it's built into sslcert::certificate now, so "manifests/role/mail.pp: sslcert::certificate { 'lists.wikimedia.org': }" implies modules/secret/secrets/ssl/lists.wikimedia.org.key [01:06:44] alright [01:06:58] I'll fix the patch I have later since its latish here [01:07:09] ok [01:07:16] I think all you have to do is rename the path there, and then it will work [01:07:24] (03PS1) 10BryanDavis: Fix new section creation for each edit [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224212 [01:07:30] Krenair: ^ [01:08:19] (03PS1) 10BBlack: move majority of privates/files usage to secret() [puppet] - 10https://gerrit.wikimedia.org/r/224213 [01:11:07] (03CR) 10BBlack: [C: 04-1] "Regex/grep too aggressive, this may have hit a few non-trivial cases as well..." [puppet] - 10https://gerrit.wikimedia.org/r/224213 (owner: 10BBlack) [01:12:04] (03PS1) 10Alex Monk: Redirect most noc.wikimedia.org/conf URLs to git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/224214 [01:12:53] (03CR) 10Alex Monk: "Redirects in I9c9fb197" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222942 (owner: 10Alex Monk) [01:13:00] (03PS1) 10JanZerebecki: Fix not repeating the date header [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224215 [01:13:42] bd808, is that really the whole fix? [01:13:54] should be, yeah [01:13:55] Krenair: yes [01:14:10] ohh it's changed a bit since elee's commit.. [01:14:32] bd808: you were faster [01:15:02] jzerebecki: heh. great minds [01:15:24] (03CR) 10JanZerebecki: [C: 031] Fix new section creation for each edit [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224212 (owner: 10BryanDavis) [01:15:38] I thought about a tuple comprehension instead but wasn't sure what version of python that started in and what we have in toollabs [01:15:49] (03CR) 10Alex Monk: [C: 031] Fix new section creation for each edit [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224212 (owner: 10BryanDavis) [01:15:59] (03Abandoned) 10JanZerebecki: Fix not repeating the date header [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224215 (owner: 10JanZerebecki) [01:20:00] (03PS2) 10BBlack: move majority of privates/files usage to secret() [puppet] - 10https://gerrit.wikimedia.org/r/224213 [01:20:33] (03CR) 10BBlack: [C: 031] "Fixed now, was just one bad case to remove." [puppet] - 10https://gerrit.wikimedia.org/r/224213 (owner: 10BBlack) [01:52:07] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:53:49] RECOVERY - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 454 bytes in 1.001 second response time [01:53:57] I don't know what that is, but once again icinga IPv6 alert doesn't appear to be real [01:54:13] I happened to catch that when the alert first paged, and checked over ipv6 myself [01:55:06] meh, doesnt mean i dont hop onto the computer in paranoia anyohw =P [01:55:18] back to cooking dinner =] [02:09:36] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 00m 35s) [02:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:09:45] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-11 02:09:45+00:00 [02:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jul 11 02:25:19 UTC 2015 (duration 25m 18s) [02:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:30] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 06m 07s) [02:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:58] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [02:28:19] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-11 02:28:18+00:00 [02:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:33] !log mattflaschen Synchronized php-1.26wmf13/extensions/Flow/includes/Parsoid/Utils.php: Bump Flow to encode page name when sending to Parsoid (duration: 00m 13s) [03:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:47:33] (03PS1) 10BryanDavis: beta: remove deployment-logstash1 [puppet] - 10https://gerrit.wikimedia.org/r/224219 [03:55:07] (03CR) 10BryanDavis: "Cherry-picked to beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/224219 (owner: 10BryanDavis) [03:57:24] anyone see that the SAL is creating a new header each log? [03:59:33] (03PS2) 10BryanDavis: beta: remove deployment-logstash1 [puppet] - 10https://gerrit.wikimedia.org/r/224219 [04:02:07] Negative24: yeah. I've got a patch up to fix it [04:02:19] bd808: sounds good [04:02:42] So trivial its dumb -- https://gerrit.wikimedia.org/r/#/c/224212/ [04:03:30] ha lol [04:03:55] but where the bracket keys hard to reach :P [04:03:59] *were [04:04:19] heh. maybe :) [04:04:44] more likely folks who don't hack a lot of python making the patch [04:05:18] the differences between tuples and lists are somewhat subtle [04:05:50] yep [04:06:24] meanwhile... [04:06:37] * Negative24 is panicking because he just reset his uncommitted changes [04:06:58] !log logstash1005 fully recovered all shards [04:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:12:20] (03PS4) 10Dzahn: tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) [04:12:35] !log rebooting logstash1006 [04:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:13:58] PROBLEM - Host logstash1006 is DOWN: PING CRITICAL - Packet loss = 100% [04:14:08] (03PS8) 10Negative24: Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) [04:15:17] RECOVERY - Host logstash1006 is UPING OK - Packet loss = 0%, RTA = 0.50 ms [04:20:59] !log Upgraded Elasticsearch to 1.6.0 on logstash1006 [04:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:21:10] !log Logstash cluster upgrade complete! Kibana working again [04:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:21:24] jgage: ^ it's done! [04:22:40] (03CR) 10Dzahn: [C: 032] Fix new section creation for each edit [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224212 (owner: 10BryanDavis) [04:22:43] (03Merged) 10jenkins-bot: Fix new section creation for each edit [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224212 (owner: 10BryanDavis) [04:24:40] I tried to cancel the downtime for the logstash icinga checks but it told me I don't have rights to do that. So... cancel at will. cluster is green and I'm done messing about. [04:26:15] we have a shell script to schedule downtime, but currently needs shell on neon [04:26:41] i wish we could somehow safely allow it on terbium and "per service" [04:26:51] now that I have "manager" in my title m.ark is even less likely to give me root :) [04:31:37] not root, a nicely crafted admin group with just the right sudo command that is needed but nothing else [04:32:42] speaking of that.. I need to figure out how to sudo for the new rsync I want to add to scap to keep mira updated properly [04:32:57] fun for tomorrow I think [04:35:07] (03PS1) 10Dzahn: bump version to 1.7.11 [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224221 [04:36:36] (03PS2) 10Dzahn: bump version to 1.7.11 [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224221 [04:37:12] bd808: just this for now ^:?) [04:37:26] but i dont think i should install it right now, heh [04:37:38] might not be the best time [04:37:40] just prepare it to be built and deployed [04:37:44] wont touch repo [04:38:24] why isn't that just a normal bot on the job grid? [04:38:36] hysterical raisins? [04:38:38] it is that too [04:38:46] define normal :?:) [04:39:10] virtualenv based [04:39:20] * bd808 makes his own normal [04:39:26] i dont know:) [04:40:59] i just recently learned how to start/stop it, but it is running on the grid [04:41:11] qdel, qstat etc [04:41:58] I can't wait for Yuvi.Panda to replace that cruft [04:42:04] and when you restart it it may run on a different exec node [04:42:14] which may have a different package version [04:42:19] heh [04:42:21] if we had dpkg errors [04:42:37] which makes you think "why did it work for him but not for me, we both restarted" [04:42:53] I think "normal" as I defined it pulls from NFS [04:43:15] from the tool's $HOME [04:43:20] there are also multiple copies of it, serving the different channels , under different nick names [04:44:41] (03CR) 10Dzahn: [C: 032] "that - but i won't touch the repository right now at Friday night" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224221 (owner: 10Dzahn) [04:45:53] labs exec nodes are without puppet errors after manual fixes [04:46:07] until the next version is installed that is [04:46:39] i believe the ones with trusty still had an issue, but it is actually running on the precise hosts which are fine [04:47:31] out again for now [04:49:38] o/ [04:55:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jul 11 04:55:56 UTC 2015 (duration 55m 55s) [04:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:57:53] (03PS5) 10Dzahn: tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) [05:01:01] (03CR) 10Dzahn: "and after this we could close T98816 as being completed (moved from github to gerrit, cloned by puppet, fully reinstallable by just applyi" [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [05:09:58] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [05:14:58] (03PS9) 10Negative24: Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) [05:40:22] (03CR) 10Negative24: "Code passed the test (role configured everything from start to finish). Ready for merging." [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) (owner: 10Negative24) [06:30:07] PROBLEM - puppet last run on mw1199 is CRITICAL puppet fail [06:31:08] PROBLEM - puppet last run on mc2007 is CRITICAL Puppet has 2 failures [06:31:58] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 2 failures [06:32:48] PROBLEM - puppet last run on db2056 is CRITICAL Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:35:58] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [06:53:58] PROBLEM - puppet last run on db1026 is CRITICAL Puppet has 1 failures [06:56:49] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on db2056 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mc2007 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw1199 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:29] 6operations, 6Multimedia, 10Wikimedia-Media-storage, 7user-notice: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1447047 (10Nemo_bis) [07:20:08] RECOVERY - puppet last run on db1026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:22:48] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 13.79% of data above the critical threshold [100000000.0] [07:49:33] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1447063 (10Edokter) I chose those values as they inherently include their own 1.5x and 2x versions (not al of them need to be l... [08:28:07] PROBLEM - puppet last run on elastic1020 is CRITICAL Puppet has 1 failures [08:55:58] RECOVERY - puppet last run on elastic1020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:05:49] (03Abandoned) 10Giuseppe Lavagetto: BOGUS: attempt at making nodepool compile in my tests [puppet] - 10https://gerrit.wikimedia.org/r/224032 (owner: 10Giuseppe Lavagetto) [09:41:48] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1438 bytes in 0.122 second response time [11:51:48] PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail [12:19:50] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:33:46] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1447335 (10jcrespo) @Cmjohnson @Springle db1002-db1007 are not being used as far as I know. How crazy would be to decommission some or of all of it and use the disks to fix db1050 + having a pool of replacements? [12:57:25] (03PS1) 10Chmarkine: Rank all ECDHE > all DHE > all RSA [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) [13:00:25] (03PS2) 10Chmarkine: Rank all ECDHE > all DHE > all RSA [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) [13:45:32] gwicke: around? [13:45:43] You should check the length of the restbase job queues... [13:45:51] they are in the millions for Wikidata [13:46:57] "A million jobs isn't cool. You know what's cool? A billion jobs!" [13:47:03] * YuviPanda disappears back into airport [13:51:10] Is that a comment from Jeb Bush on Obama results? [14:27:58] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1447380 (10Cmjohnson) From my end it's not difficult at all. Let's wait for @springle to comment. [14:31:59] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 75.00% of data above the critical threshold [35.0] [14:32:10] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, and 3 others: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1447396 (10AndyRussG) p:5Unbreak!>3High [14:39:28] RECOVERY - Persistent high iowait on labstore2001 is OK Less than 50.00% above the threshold [25.0] [15:10:59] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 62.50% of data above the critical threshold [35.0] [15:12:57] RECOVERY - Persistent high iowait on labstore2001 is OK Less than 50.00% above the threshold [25.0] [17:22:27] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1447493 (10Joe) For the first time, I've seen today expired keys getting expunged. It seems promising but I guess we'll have a definitive... [17:46:30] (03PS1) 10Ori.livneh: varnishrls: handle interleaved transactions [puppet] - 10https://gerrit.wikimedia.org/r/224238 [17:51:56] (03PS1) 10Glaisher: Add a note about RCStream to irc.wikimedia.org MOTD [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) [17:52:56] (03CR) 10Ori.livneh: [C: 032] "Tested on cp1065. Overall CPU utilization is reduced." [puppet] - 10https://gerrit.wikimedia.org/r/224238 (owner: 10Ori.livneh) [17:55:18] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1410 bytes in 0.452 second response time [18:03:12] (03CR) 10Ori.livneh: Add a note about RCStream to irc.wikimedia.org MOTD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [18:05:37] 6operations, 10Wikimedia-DNS, 10Wikimedia-General-or-Unknown, 7HTTPS: Certificate error on https://www.meta.wikimedia.org/ redirect - https://phabricator.wikimedia.org/T105098#1447526 (10Glaisher) www.$lang.$project.org domains were also killed recently due to this issue (T102815). I think this should als... [18:10:25] (03CR) 10Glaisher: Add a note about RCStream to irc.wikimedia.org MOTD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [18:11:48] ssh_exchange_identification: read: Connection reset by peer [18:11:49] fatal: Could not read from remote repository. [18:11:52] ?? [18:13:01] (03PS2) 10Glaisher: Add a note about RCStream to irc.wikimedia.org MOTD [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) [18:13:25] Glaisher, when connecting to what? [18:13:27] gerrit? [18:13:34] git review [18:13:48] what does `git remote -v` show? [18:14:09] gerrit ssh://glaisher@gerrit.wikimedia.org:29418/operations/puppet.git (fetch) [18:14:12] gerrit ssh://glaisher@gerrit.wikimedia.org:29418/operations/puppet.git (push) [18:14:12] origin ssh://glaisher@gerrit.wikimedia.org:29418/operations/puppet (fetch) [18:14:12] origin ssh://glaisher@gerrit.wikimedia.org:29418/operations/puppet (push) [18:14:32] should just work [18:14:41] did it randomly fail and then start working again? [18:15:20] It never happened before [18:15:27] (happened a few times this month though) [18:15:32] but on the second try, it works [18:15:44] weird [18:15:55] I don't remember changing anything on my side [18:18:46] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-DNS, 7domains: Faulty DNS setup for wikipedia.is - https://phabricator.wikimedia.org/T103915#1447532 (10Glaisher) Can we close this now? @Slaporte Can you ask for reconfirmation from them? [18:24:13] (03CR) 10Ori.livneh: [C: 031] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [18:58:27] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [19:33:36] !log restbase: setting gc_grace_seconds to 604800 (1 week) on local_group_wikipedia_T_parsoid_html.data [19:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:55] !log from restbase1002, starting revision culling process (node thin_out_key_rev_value_data.js `hostname -i` local_group_wikimedia_T_parsoid_html 2>&1 | tee >(gzip -c > local_group_wikimedia_T_parsoid_html.log.`date +%s`.gz)) [19:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:42] 6operations, 10Wikimedia-DNS, 10Wikimedia-General-or-Unknown, 7HTTPS: Certificate error on https://www.meta.wikimedia.org/ redirect - https://phabricator.wikimedia.org/T105098#1447577 (10BBlack) [19:47:43] 6operations, 10Traffic: Fix/decom multiple-subdomain wikis in wikimedia.org - https://phabricator.wikimedia.org/T102826#1447578 (10BBlack) [19:48:46] !log stopping labsdb1002 after table corruption has been detected [19:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:26] (03CR) 10BBlack: [C: 04-1] "I think this is probably the right move to make, but let's hold off until after the primary unified-cert domains gain their ECDSA key (sho" [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) (owner: 10Chmarkine) [20:06:07] PROBLEM - puppet last run on mw2204 is CRITICAL puppet fail [20:34:09] RECOVERY - puppet last run on mw2204 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:37:41] (03CR) 10BBlack: "Actually, looking at the big cipher list table a little down from this anchor ..." [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) (owner: 10Chmarkine) [21:39:32] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1447683 (10bd808) [21:58:09] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1447698 (10bd808) a:5bd808>3RobH Assigning back to @RobH so he can coordinate the next steps based on the rough outline in T97545#1441645. [22:22:13] (03PS3) 10Florianschmidtwelzow: Enable alternate and canonical links for mobile/desktop pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212022 (https://phabricator.wikimedia.org/T99587) [22:22:42] (03Abandoned) 10Florianschmidtwelzow: Enable alternate and canonical links for mobile/desktop pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212022 (https://phabricator.wikimedia.org/T99587) (owner: 10Florianschmidtwelzow) [22:35:24] This is the appropriate way to heal servers! :) https://www.flickr.com/photos/textfiles/19005464958/ [22:41:04] * _joe_ changes his title to "bearded shaman" [22:46:02] :) [23:01:27] PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail [23:03:22] (03PS1) 10Ori.livneh: varnishrls: Don't crash on Cache-control: private [puppet] - 10https://gerrit.wikimedia.org/r/224294 [23:03:46] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishrls: Don't crash on Cache-control: private [puppet] - 10https://gerrit.wikimedia.org/r/224294 (owner: 10Ori.livneh) [23:27:28] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures