[00:04:03] (03PS4) 10Alex Monk: Make foreachwiki accept dblist expressions [puppet] - 10https://gerrit.wikimedia.org/r/232675 (https://phabricator.wikimedia.org/T101213) [00:04:41] ori, ^ [00:04:53] (03PS5) 10Alex Monk: Make foreachwikiindblist accept dblist expressions [puppet] - 10https://gerrit.wikimedia.org/r/232675 (https://phabricator.wikimedia.org/T101213) [00:18:57] Krenair: lgtm, do you need it now or could I test it later? [00:19:21] Not urgent at all [00:20:06] it's a useful feature addition really [00:32:48] 6operations: rename gerrit2 account in LDAP - https://phabricator.wikimedia.org/T80648#1613008 (10demon) 5Open>3declined a:3demon Never gonna happen. More likely to kill Gerrit first :p [00:39:14] (03PS1) 10Alex Monk: Merge the two copies of udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/236496 [00:45:26] (03CR) 10John Vandenberg: toollabs: add script to generate python package listings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [00:49:15] ori, reckon we can just delete the one in templates/misc? [01:50:17] (03Abandoned) 10Alex Monk: Merge the two copies of udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/236496 (owner: 10Alex Monk) [02:05:43] PROBLEM - puppet last run on ms-be1003 is CRITICAL: Timeout while attempting connection [02:07:03] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [02:08:10] (03PS1) 10Alex Monk: rm templates/misc/udpmxircecho.py.erb [puppet] - 10https://gerrit.wikimedia.org/r/236499 [02:10:06] (03PS1) 10Alex Monk: tcpircbot: Also take input from stdin [puppet] - 10https://gerrit.wikimedia.org/r/236500 [02:11:01] (03CR) 10jenkins-bot: [V: 04-1] tcpircbot: Also take input from stdin [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [02:12:37] 02:10:15 modules/tcpircbot/files/tcpircbot.py:127:80: E501 line too long (82 > 79 characters) [02:12:40] * Krenair grumbles [02:13:09] (03PS2) 10Alex Monk: tcpircbot: Also take input from stdin [puppet] - 10https://gerrit.wikimedia.org/r/236500 [02:13:35] PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail [02:20:09] !log l10nupdate@tin Synchronized php-1.26wmf21/cache/l10n: l10nupdate for 1.26wmf21 (duration: 06m 22s) [02:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:28] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf21) at 2015-09-07 02:23:27+00:00 [02:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:34] !log mwscript deleteEqualMessages.php --wiki pmswiki [02:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:34] RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:49:16] (03CR) 10Alex Monk: [C: 04-1] tcpircbot: Also take input from stdin [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [03:54:53] (03PS28) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [03:59:34] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (102133s 100000s) [04:31:35] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (31641 100000s) [04:33:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 7 04:33:11 UTC 2015 (duration 33m 10s) [04:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:00:36] (03Abandoned) 10Ricordisamoa: Add categories for quality badges on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205013 (owner: 10Ricordisamoa) [06:04:00] (03PS29) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [06:20:06] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1613158 (10MoritzMuehlenhoff) [06:29:58] 6operations, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, 5ContentTranslation-Release6, and 4 others: Review and create table for Content Translation - https://phabricator.wikimedia.org/T111317#1613171 (10Arrbee) [06:30:04] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [06:31:44] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:13] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:24] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:25] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:43] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:54] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:43] 6operations, 6Labs, 10Labs-Other-Projects: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1613196 (10Nemo_bis) 5Open>3Resolved a:3Nemo_bis (Note, creating a report with me cc'd doesn't produce notifications, per T107552.) What's the limit? [[https://gan... [07:16:15] (03PS5) 10Phedenskog: Collect missing Navigation Timing metrics [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) [07:18:45] 6operations: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254#1613224 (10hashar) [07:26:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [07:30:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [07:36:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [07:40:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [07:45:19] (03PS3) 10Addshore: wgRCWatchCategoryMembership false for commons & wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235467 (https://phabricator.wikimedia.org/T109707) [07:46:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [07:52:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [07:56:44] 6operations: morebots need restart - https://phabricator.wikimedia.org/T28782#1613261 (10hashar) For later reference, the restart doc is at https://wikitech.wikimedia.org/wiki/Morebots . Require a tools labs account in the `morebots` group. [07:56:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [08:00:04] hashar zeljkof: Dear anthropoid, the time has come. Please deploy CI infrastructure (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150907T0800). [08:00:15] yeahhhh [08:00:17] zeljkof-meeting: ^^: -) [08:00:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [08:00:55] hashar: coming, installing hangouts plugin :| [08:01:14] we are upgrading Wikimedia Jenkins right now [08:05:07] !log Upgrading Jenkins [08:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:06:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [08:12:17] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 5 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1613281 (10KartikMistry) [08:13:13] !log Jenkins upgraded to latest LTS ( https://phabricator.wikimedia.org/T111326 ) [08:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [09:20:38] morning jynus [09:20:45] paravoid, morning [09:33:08] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1613380 (10fgiunchedi) ssd vs no ssd I guess depends on the workload, if we are bulk writing and enough ram to cache reads also spinning disks might do it (for comparison, restbase sees ~200 IOP... [09:34:28] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1613381 (10fgiunchedi) [09:35:02] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019686 (10fgiunchedi) afaik this isn't blocked on operations, see https://phabricator.wikimedia.org/T88758#1156490 [09:39:40] (03PS1) 10Muehlenhoff: Add definitions for LVSes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236519 [09:41:34] moritzm: hey! got a few minutes to help me debug an iptables issue? [09:42:36] YuviPanda: sure, which system? [09:42:52] moritzm: k8s-webproxy-02.eqiad.wmflabs [09:43:23] woah [09:43:25] it works now [09:43:25] wtf [09:43:46] aha [09:43:48] no it doesn't [09:43:56] moritzm: so [09:43:59] yuvipanda@k8s-webproxy-02:~$ dig @192.168.0.100 trusty-php-9.default.svc.kube [09:43:59] ;; reply from unexpected source: 192.168.46.0#47192, expected 192.168.0.100#53 [09:44:16] with [09:44:19] REDIRECT udp -- anywhere 192.168.0.100 /* default/kube-dns:dns */ udp dpt:domain redir ports 47192 [09:44:19] DNAT udp -- anywhere 192.168.0.100 /* default/kube-dns:dns */ udp dpt:domain to:10.68.18.80:47192 [09:44:25] there are the same rules for tcp [09:44:32] and DNS via tcp works, just not via udp [09:45:22] let me have a look, I can't login, though. Missing some group membership on wikitech? [09:45:44] moritzm: yeah, you can use your root key and login as root but I'll add you to the project now [09:45:48] moritzm: what's your wikitech username? [09:46:13] Muehlenhoff [09:46:33] moritzm: ok I've added you [09:46:41] dnat changes the destination, not the source [09:48:35] paravoid: hmm, so for all other similar iptables rules, there's only DNAT and no SNAT, but they all do work (but they're also all tcp). is tcp ok with not having replies come back from where the connection was established to? [09:48:45] paravoid: also, can I test that by manually adding an SNAT rule? [09:48:51] no it's not [09:49:21] I figured, so am trying to figure out why tcp works and udp doesn't. [09:49:34] is it all in the same box? [09:49:40] what are you trying to do exactly? [09:49:46] ah, so this is kubernetes [09:49:50] these are 'service IPs' [09:50:10] points to a container running on a different box, the iptables rules are managed by kubernetes. [09:50:34] according to the docs this should work, and it does for TCP. I'm trying to gather enough data for a bug report [09:51:06] so what it does is sets up iptables rules to make those 'service IPs' work by forwarding them appropriately. and these rules are maintained by the kube-proxy component [09:51:25] so if this is a bug there, I need to figure out what exactly that bug is. [09:51:34] why do you have both a REDIRECT and a DNAT? [09:52:00] this is complicated by me not fully understanding how things work at the IP layer. [09:52:19] paravoid: that's what kube-proxy sets up. [09:52:26] that doesn't make any sense [09:52:58] redirect "rewrites" the destination to be the localhost [09:53:08] dnat rewrites it to be 10.68.18.80 [09:54:14] 6operations, 7Database: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1613396 (10jcrespo) @faidon Are you suggesting encrypting WAN links or all of them? Because all of them has a price, and should be provisioned adequately: https://www.percona.com/blog/2013/10/10/mysql-ssl-per... [09:54:53] paravoid: I see. so there's an overlay network also involved, I wonder if that's somehow related [09:55:09] (this is why I've been reading up on networking stuff lately, since I'm slightly in over my head here) [09:56:23] 6operations, 7Database: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1613397 (10faidon) Definitely cross-DC links, unsure about the rest. I'd like to see the performance impact on our own first anyway — the data we have from the HTTPS transition suggests that the overhead was... [09:56:42] kube-prox 8665 root 16u IPv6 33053 0t0 UDP *:47192 [09:56:53] paravoid: ^ kube-proxy is listening on that port [09:57:02] (that's the one from the REDIRECT) [09:58:08] hmm, also the redirect is chained from PREROUTING while the DNAT is in OUTPUT [09:58:50] ok, then it's the opposite flow, that makes sense [09:59:16] well, in that case though, it should have been sport, no? [10:00:03] 6operations, 6Performance-Team, 7Graphite, 5Patch-For-Review: "sum" aggregation broken in Graphite - https://phabricator.wikimedia.org/T111170#1613408 (10fgiunchedi) a:5Krinkle>3fgiunchedi I'll take this one to run the script from @ori [10:00:15] wait, let me look myself at that iptables output [10:00:55] yeah, I realize I should've pasted full output. but this is in k8s-webproxy-02 [10:01:06] 6operations, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, 5ContentTranslation-Release6, and 4 others: Review and create table for Content Translation - https://phabricator.wikimedia.org/T111317#1613412 (10jcrespo) a:5jcrespo>3None [10:01:19] paravoid: in addition, this *does* work in k8s-worker-01 (And other worker nodes). Just not on the nodes with just the kube-proxy and the overlay network [10:02:41] and afaict the iptables rules aren't any different [10:03:30] and you're running dig from k8s-webproxy-02 too, right? [10:03:39] paravoid: yes [10:03:46] so yeah, the PREROUTING rules aren't being used at all here [10:03:49] forget about them [10:03:52] ok [10:04:49] !log powercycle ms-be1003, loadavg skyrocketed [10:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:33] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [10:08:25] RECOVERY - very high load average likely xfs on ms-be1003 is OK: OK - load average: 24.36, 8.94, 3.23 [10:08:34] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:08:50] well, yes, you're missing an SNAT :) [10:09:08] paravoid: so why does it work on k8s-worker-01? [10:09:11] 10:05:49.219844 IP 192.168.46.0.55008 > 10.68.18.80.47192: UDP, length 58 [10:09:14] 10:05:49.225730 IP 192.168.46.0.47192 > 192.168.46.0.55008: UDP, length 63 [10:10:03] the original flow was 192.168.46.0.55008 -> 192.168.0.100:53, right [10:10:30] right [10:10:31] the DNAT changes the destination to 10.68.18.80.47192 [10:10:58] and then you get a response from the machine itself [10:11:18] * YuviPanda facepalms [10:11:26] it works from worker-01 because the dns server is on that instance [10:11:42] (doesn't work from -03) [10:11:48] on an unrelated note, the 192.168.46.0/16 is not a valid IP to assign to a machine [10:12:24] 6operations: ms-be1003 has irregular metrics on Ganglia - https://phabricator.wikimedia.org/T111658#1613420 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi powercycled, xfs wasn't amused [10:12:46] it works, usually, sometimes, not a good idea though [10:13:06] never use the first and the last IP address in an (IPv4) subnet [10:13:09] paravoid: oh? I assigned flannel 192.168.0.0/16 and it is supposed to hand out /24s to each instance [10:13:10] (unless it's a /31 :) [10:13:32] the first is called the "network" address, the last the "broadcast" [10:13:54] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:34] hmm, I wonder if flannel will actually use / hand out either of those [10:15:43] paravoid: ok, so if I understand this correctly, you're saying that kube-proxy should also setup SNAT and it isn't doing so? [10:20:20] * YuviPanda is furiously reading up on iptables NATing to feel less dumb [10:21:54] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61480 bytes in 5.231 second response time [10:22:54] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1613434 (10Elitre) If this is fixed, I think there should be user notice of the change. I assume stuff like MassMessage delivery lists need to b... [10:23:05] jynus: around? [10:23:20] kart_, yes [10:23:23] YuviPanda: https://www.frozentux.net/iptables-tutorial/chunkyhtml/images/tables_traverse.jpg [10:23:38] jynus: we can go ahead with ContentTranslation tables. [10:23:56] jynus: and encoding fix. [10:24:31] ok, I will have a look at it at the end of the week [10:24:40] paravoid: in the case of dig, I only care about OUTPUT and POSTROUTING, right? [10:24:48] since it's a local process [10:25:01] remember, you have two different flows here [10:25:14] one is the DNS request, the other one is the response [10:25:48] aaah, right. [10:25:49] also, even in the case of the DNS request [10:25:59] the rewritten address is local again, right? [10:26:39] so it goes local process -> OUTPUT -> routing -> POSTROUTING -> network -> network -> PREROUTING -> routing -> INPUT -> local process [10:26:45] it loops back in again right [10:26:55] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, not sure about the graceful stop script" (032 comments) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/224390 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [10:27:05] aaaaha! I had somehow not noticed that the to in the DNAT all point to localhost [10:27:33] not localhost, 10.68.18.80 which is a local addres [10:29:03] ok [10:29:41] (03PS3) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [10:30:42] (03CR) 10jenkins-bot: [V: 04-1] Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [10:30:56] I'm curious, can you back up a bit on how this whole thing is supposed to work? [10:31:10] paravoid: https://github.com/kubernetes/kubernetes/blob/master/docs/design/networking.md is their networking design doc [10:31:39] (03CR) 10Muehlenhoff: 0.1.1-wmf3: statsd and systemd support (031 comment) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/224390 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [10:32:01] 10Ops-Access-Requests, 6operations: Requesting access to hadoop / hive (analytics-privatedata-users) for Addshore - https://phabricator.wikimedia.org/T111204#1613452 (10jcrespo) p:5Triage>3Normal [10:34:10] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1613455 (10akosiaris) What @fgiunchedi says. Datastax officially recommends SSDs for cassandra http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.h... [10:35:20] paravoid: but kube-proxy and the overlay network are supposed to allow access to service IPs in all the nodes they're setup in, via iptables rules. [10:35:43] which means I'm now confused as to why the iptables rules just seem to forward packets into ports that the kube-proxy is listening on [10:36:39] (03CR) 10Alexandros Kosiaris: Add definitions for LVSes in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236519 (owner: 10Muehlenhoff) [10:38:31] (03PS4) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [10:39:16] (03CR) 10jenkins-bot: [V: 04-1] Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [10:39:52] (03PS2) 10Filippo Giunchedi: WIP: certificate/keystore generation script [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) (owner: 10Eevans) [10:41:08] (03PS5) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [10:41:52] (03CR) 10jenkins-bot: [V: 04-1] Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [10:44:05] (03PS6) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [10:45:01] (03CR) 10Zhuyifei1999: [C: 031] Fix URL to interwiki cache on noc.wikimedia.org [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/220075 (owner: 10Hydriz) [10:45:09] (03PS7) 10Gilles: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) [10:45:29] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1613491 (10jcrespo) [10:46:31] (03CR) 10Muehlenhoff: Add definitions for LVSes in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236519 (owner: 10Muehlenhoff) [10:46:49] (03PS3) 10Filippo Giunchedi: WIP: certificate/keystore generation script [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) (owner: 10Eevans) [10:58:14] (03PS2) 10Muehlenhoff: Add definitions for LVSes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236519 [10:59:01] (03PS4) 10Filippo Giunchedi: WIP: certificate/keystore generation script [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) (owner: 10Eevans) [11:04:40] (03PS5) 10Filippo Giunchedi: WIP: certificate/keystore generation script [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) (owner: 10Eevans) [11:14:41] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1613568 (10jcrespo) p:5Triage>3Normal [11:16:04] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1613571 (10jcrespo) p:5Triage>3Normal [11:16:51] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1613575 (10jcrespo) p:5Triage>3Normal [11:24:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [11:34:41] http://www.cisco.com/c/en/us/support/docs/field-notices/636/fn63697.html (via Unhammer) [12:25:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [12:30:13] paravoid: I've re-commented on https://github.com/kubernetes/kubernetes/issues/2996 [12:33:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [12:34:32] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1613671 (10Krenair) >>! In T11823#1613434, @Elitre wrote: > If this is fixed, I think there should be user notice of the change. I assume stuff... [13:08:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [13:12:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [13:12:35] (03CR) 10Mobrovac: [C: 031] Disallow indexing for /api/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236200 (https://phabricator.wikimedia.org/T109023) (owner: 10GWicke) [13:12:48] (03PS1) 10Yuvipanda: labs: Allow NFS to be turned off on a per-instance level [puppet] - 10https://gerrit.wikimedia.org/r/236543 [13:16:11] (03PS2) 10Yuvipanda: labs: Allow NFS to be turned off on a per-instance level [puppet] - 10https://gerrit.wikimedia.org/r/236543 [13:24:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [13:34:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [13:35:44] (03CR) 10Alexandros Kosiaris: "Since my previous comments, there has been a meeting where a lot of things were clarified. This is a first round of comments. Apart from t" (0336 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [13:37:24] (03CR) 10Alexandros Kosiaris: "Now that the goals of this are clear, I removed and readded Giuseppe to remove the -2 given before due to misunderstanding." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [13:44:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [13:44:59] 6operations, 6Performance-Team, 7Graphite, 5Patch-For-Review: "sum" aggregation broken in Graphite - https://phabricator.wikimedia.org/T111170#1613822 (10fgiunchedi) since the above seems to be working I've ran the script to all `mw.js` hierarchy, will apply everywhere tomorrow if no issues arise [13:54:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [14:01:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:04:01] 6operations, 10RESTBase-Cassandra, 10hardware-requests: codfw 3x spares for cassandra encryption testing - https://phabricator.wikimedia.org/T111382#1613868 (10fgiunchedi) >>! In T111382#1605591, @fgiunchedi wrote: >>>! In T111382#1602899, @GWicke wrote: >>> The recent restbase expansion order was entirely f... [14:04:41] 6operations: icinga (neon) is out of CPU headroom - https://phabricator.wikimedia.org/T110822#1613879 (10jcrespo) p:5Triage>3Normal @BBlack, in your opinion, is this something that should be handled in hardware or in software, having into account cost effectiveness, or did you not investigate enough to have... [14:11:33] kart_, could you update the ticket about the configuration change to assign it to me and explicitly say that you are waiting on me? Otherwise, I may miss it [14:12:23] (sorry, too many things to keep track of otherwise :-)) [14:14:44] (03PS3) 10Filippo Giunchedi: elasticsearch partman and autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/235893 (owner: 10Rush) [14:16:23] (03PS4) 10Filippo Giunchedi: elasticsearch partman and autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/235893 (https://phabricator.wikimedia.org/T111080) (owner: 10Rush) [14:18:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] elasticsearch partman and autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/235893 (https://phabricator.wikimedia.org/T111080) (owner: 10Rush) [14:23:39] (03PS5) 10Filippo Giunchedi: New addition elasticsearch20[0-2][0-9] [dns] - 10https://gerrit.wikimedia.org/r/235906 (owner: 10Rush) [14:24:09] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1613933 (10jcrespo) Hello, @Jalexander, This is probably my fault, but I do not fully understand completely the request here. If Jalexender is leaving, his email should be redirected to his manager, and from there... [14:24:11] (03CR) 10Filippo Giunchedi: "good catch Papaul, I've fixed the forward entries" [dns] - 10https://gerrit.wikimedia.org/r/235906 (owner: 10Rush) [14:25:26] jynus, that ticket is because Philippe is leaving, not Jalexander [14:25:48] their email is not connected to the noc@ login AFAIK [14:26:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] New addition elasticsearch20[0-2][0-9] [dns] - 10https://gerrit.wikimedia.org/r/235906 (owner: 10Rush) [14:29:28] Krenair, let them answer. The "Jalexander is leaving" was a typo, already corrected [14:30:09] 6operations, 10ops-codfw, 5Patch-For-Review: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1613936 (10fgiunchedi) [14:30:41] (03PS1) 10Muehlenhoff: Add a Salt returner for local logging [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/236549 [14:32:44] (03PS1) 10Muehlenhoff: Add TODO file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/236550 [14:33:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a Salt returner for local logging [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/236549 (owner: 10Muehlenhoff) [14:33:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add TODO file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/236550 (owner: 10Muehlenhoff) [14:42:03] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1613969 (10jcrespo) p:5Triage>3Low [14:55:42] 6operations, 10RESTBase-Cassandra, 10hardware-requests: codfw 3x spares for cassandra encryption testing - https://phabricator.wikimedia.org/T111382#1614009 (10Papaul) @fgiunchedi wmf5846 and wmf5848 were never used since we moved them form Tampa.If we need to use them we need to go ahead and set and configu... [14:57:40] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1614010 (10jcrespo) Since 2-3 September, the connections spikes are gone. So I had to ask: Who touched something? :-P Or was it an externally-created problem? My... [15:01:03] 6operations, 10ops-codfw: provision wmf5846 and wmf5848 - https://phabricator.wikimedia.org/T111697#1614014 (10fgiunchedi) 3NEW a:3Papaul [15:01:31] (03PS1) 10Jcrespo: Repool es1004 after maintenance; pool es1018 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236554 [15:01:37] 6operations, 10RESTBase-Cassandra, 10hardware-requests: codfw 3x spares for cassandra encryption testing - https://phabricator.wikimedia.org/T111382#1602749 (10fgiunchedi) @papaul thanks! I've filed {T111697} [15:02:23] 6operations, 10ops-codfw: provision wmf5846 and wmf5848 - https://phabricator.wikimedia.org/T111697#1614014 (10fgiunchedi) to clarify, this includes mgmt setup on the machine, mgmt dns entries, etc [15:03:06] (03CR) 10Jcrespo: [C: 032] Repool es1004 after maintenance; pool es1018 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236554 (owner: 10Jcrespo) [15:05:39] Krenair, I see 2 undeployed commits from you on tin, were you going to deploy them? [15:05:55] (03PS3) 10Yuvipanda: labs: Allow NFS to be turned off on a per-instance level [puppet] - 10https://gerrit.wikimedia.org/r/236543 [15:05:55] 6operations, 10ops-codfw: provision wmf5846 and wmf5848 - https://phabricator.wikimedia.org/T111697#1614034 (10Papaul) Okay will work on this once at the DC tomorrow. [15:06:02] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Allow NFS to be turned off on a per-instance level [puppet] - 10https://gerrit.wikimedia.org/r/236543 (owner: 10Yuvipanda) [15:06:09] I wanted to sync an unrelated config change [15:06:22] jynus, where? [15:06:41] Change ukwikivoyage logo, take 2" [15:06:57] that was undeployed?? [15:07:12] Pretty sure I just did that one instantly [15:07:20] oh, maybe it was not merged [15:07:22] let me check [15:07:30] it looks merged to me [15:07:42] doesn't show up on "git log HEAD..origin/master" [15:07:50] 6operations: recommended ssh ciphers/kexalgorithms combination doesn't work for ilo - https://phabricator.wikimedia.org/T111698#1614039 (10fgiunchedi) 3NEW [15:07:51] yes it was, just it wasnt updated [15:07:58] wasn't updated? [15:07:59] so do not worry [15:08:31] the master branch on tin staging [15:08:35] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:08:46] it was on there [15:09:07] yes, there was some ordering issue, it was just that [15:09:18] some other change was undeployed [15:09:21] but mine was done [15:09:23] * Krenair goes afk [15:10:30] (03PS1) 10Alex Monk: beta apache config: Move wikipedia and wikibooks out of main.conf into their own files [puppet] - 10https://gerrit.wikimedia.org/r/236555 [15:10:32] (03PS1) 10Alex Monk: beta apache config: make wikipedia.conf more consistent with the other files [puppet] - 10https://gerrit.wikimedia.org/r/236556 [15:11:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1004, pool es1018 (duration: 00m 10s) [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:51] ori, re. ircecho/tcpircbot - I found that the way those are actually used in production relies on reading in from a file, rather than stdin [15:23:26] should still be easy [15:23:45] iit's unix, it's files all the way down [15:24:27] http://i.imgur.com/366vkX9.jpg [15:24:29] yeah, I'm expecting it to be, just annoying since my initial assumption was that it'd be stdin. and I had written code relying on that [15:24:53] not a great assumption. but still [15:26:31] (03CR) 10Alex Monk: "Please merge such changes on tin, even if you don't strictly need to deploy them because they only get run in labs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236206 (https://phabricator.wikimedia.org/T111267) (owner: 10Mattflaschen) [15:41:50] !log krenair@tin Synchronized php-1.26wmf21/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: https://gerrit.wikimedia.org/r/#/c/236558/ (duration: 00m 12s) [15:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:01] phuedx, hey [15:45:07] phuedx, can you please make MF log to those two broken schemas? [15:45:59] (03CR) 10Hashar: "Thanks for the reviews! Commented inline, I will attempt to remove the nodepool-graceful-stop script by relying exclusively on systemd." (032 comments) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/224390 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [16:02:05] phuedx, quite a number of entries have suddenly come in since that sync, so I think it's safe to assume it's solved the issue [16:02:54] phuedx, minding sending the appropriate emails to notify of the data loss? [16:02:55] 6operations, 10ops-codfw, 5Patch-For-Review: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1614175 (10fgiunchedi) still missing: * BIOS legacy boot setting (defaults to UEFI) * mac address entries in puppet `modules/install_server/files/dhcpd/linux-host-entries.ttyS1-11... [16:04:00] jynus: assigned. [16:04:06] thanks [16:04:40] 6operations, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, 5ContentTranslation-Release6, and 4 others: Review and create table for Content Translation - https://phabricator.wikimedia.org/T111317#1614182 (10KartikMistry) a:3jcrespo [16:10:04] (03PS1) 10Muehlenhoff: Use compound matching for minion targeting [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/236562 [16:11:27] Krenair: feel free to throw up even incomplete code on gerrit if you want feedback [16:18:20] 6operations, 7Monitoring: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#1614206 (10jcrespo) [16:19:40] (03PS1) 10Alex Monk: beta apache config: fix instances of 'wikibooks' that were copy+pasted everywhere [puppet] - 10https://gerrit.wikimedia.org/r/236563 [16:21:13] (03PS2) 10Alex Monk: beta apache config: make wikipedia.conf more consistent with the other files [puppet] - 10https://gerrit.wikimedia.org/r/236556 [16:23:39] (03PS2) 10Alex Monk: beta apache config: Move wikipedia and wikibooks out of main.conf into their own files [puppet] - 10https://gerrit.wikimedia.org/r/236555 [16:24:22] (03PS3) 10Alex Monk: beta apache config: make wikipedia.conf more consistent with the other files [puppet] - 10https://gerrit.wikimedia.org/r/236556 [16:24:32] (03PS2) 10Alex Monk: beta apache config: fix instances of 'wikibooks' that were copy+pasted everywhere [puppet] - 10https://gerrit.wikimedia.org/r/236563 [16:33:24] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [16:33:37] (don't worry about that) [16:37:24] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.00 ms [16:38:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use compound matching for minion targeting [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/236562 (owner: 10Muehlenhoff) [16:42:04] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [16:44:29] gah [16:44:32] sorry krenair [16:44:38] just had a ~45 minute long power cut [16:45:23] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.11 ms [16:47:44] (03PS1) 10Alex Monk: beta apache config: more consistency for wiktionary and wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/236566 [16:47:46] (03PS1) 10Alex Monk: beta apache config: remove nonsensical rewrites to http://en.wikivoyage.org/wiki/Main_Page [puppet] - 10https://gerrit.wikimedia.org/r/236567 [16:47:50] phuedx: Is it really that bad, South of the river? [16:47:58] lol [16:48:17] haha [16:48:25] * James_F stays away from the lurgie. [16:48:29] Is it really that bad, South? [16:48:52] Deskana: Says the person living 10º South of London. ;-P [16:50:24] badness* is proportional to the square of distance south of the river [16:50:41] * how much like medieval england it is [17:04:14] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:05:43] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [17:17:52] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1614298 (10jcrespo) [17:18:49] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1614300 (10jcrespo) p:5Triage>3Normal [17:27:56] 6operations, 7Availability, 7Technical-Debt: Make cookie blacklist in varnish text configuration less fragile - https://phabricator.wikimedia.org/T101857#1614316 (10jcrespo) [17:31:12] 6operations: sysctl::parameters don't take effect until next reboot (on Trusty at least) - https://phabricator.wikimedia.org/T109711#1614327 (10jcrespo) p:5Triage>3Low Low until @Andrew could add more information, confirming the issue. [17:34:28] 6operations, 6Performance-Team, 10Traffic: Split stats/metrics by cache cluster - https://phabricator.wikimedia.org/T109378#1614340 (10jcrespo) p:5Triage>3Normal Setting it to normal, but feel free to override me; this will probably will be a long term task. [17:38:44] 6operations, 6Labs, 5Patch-For-Review: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1614354 (10jcrespo) @Andrew did that last related patch fix the issue or does it require more work? [17:39:02] (03PS1) 10Yuvipanda: k8s: Use debian package for flannel [puppet] - 10https://gerrit.wikimedia.org/r/236572 [17:39:11] 6operations, 6Labs, 5Patch-For-Review: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1614355 (10jcrespo) p:5Triage>3Normal [17:39:14] (03PS2) 10Yuvipanda: k8s: Use debian package for flannel [puppet] - 10https://gerrit.wikimedia.org/r/236572 [17:39:57] (03PS1) 10Faidon Liambotis: Split cr1/cr2/mr1-ulsfo shared subnet into two [dns] - 10https://gerrit.wikimedia.org/r/236573 [17:39:59] (03PS1) 10Faidon Liambotis: Add public loopback IP for mr1-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/236574 [17:40:01] (03PS1) 10Faidon Liambotis: Add IPv6 to mr1-ulsfo (and neighboring subnets) [dns] - 10https://gerrit.wikimedia.org/r/236575 [17:40:16] (03CR) 10Yuvipanda: [C: 032] k8s: Use debian package for flannel [puppet] - 10https://gerrit.wikimedia.org/r/236572 (owner: 10Yuvipanda) [17:40:33] (03CR) 10Faidon Liambotis: [C: 032] Split cr1/cr2/mr1-ulsfo shared subnet into two [dns] - 10https://gerrit.wikimedia.org/r/236573 (owner: 10Faidon Liambotis) [17:41:02] (03CR) 10Faidon Liambotis: [C: 032] Add public loopback IP for mr1-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/236574 (owner: 10Faidon Liambotis) [17:42:00] (03PS2) 10Faidon Liambotis: Add IPv6 to mr1-ulsfo (and neighboring subnets) [dns] - 10https://gerrit.wikimedia.org/r/236575 [17:42:23] (03PS1) 10Yuvipanda: k8s: Don't make the master node a worker node as well [puppet] - 10https://gerrit.wikimedia.org/r/236577 [17:42:25] (03PS1) 10Yuvipanda: k8s: Drop outdate File require [puppet] - 10https://gerrit.wikimedia.org/r/236578 [17:42:29] (03CR) 10Faidon Liambotis: [C: 032] Add IPv6 to mr1-ulsfo (and neighboring subnets) [dns] - 10https://gerrit.wikimedia.org/r/236575 (owner: 10Faidon Liambotis) [17:42:50] 6operations, 7Monitoring, 5Patch-For-Review: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#1614363 (10jcrespo) @Dzahn @Bblack Only the tmpfs is left here, right? I am happy to help! [17:43:17] (03CR) 10Yuvipanda: [C: 032] k8s: Don't make the master node a worker node as well [puppet] - 10https://gerrit.wikimedia.org/r/236577 (owner: 10Yuvipanda) [17:43:28] (03CR) 10Yuvipanda: [C: 032] k8s: Drop outdate File require [puppet] - 10https://gerrit.wikimedia.org/r/236578 (owner: 10Yuvipanda) [17:44:05] 6operations, 7Monitoring, 5Patch-For-Review: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#1614364 (10JohnLewis) @jcrespo it seems and the manual swap increase form the default 1GB set in the lvm partman recipe to 8GB. [17:45:18] 6operations, 7Monitoring: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#1614375 (10jcrespo) p:5Triage>3Normal [17:48:59] (03CR) 10Krinkle: [C: 031] "Let's get this out before the branch cut on Tuesday so we don't miss any data." [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) (owner: 10Phedenskog) [17:51:13] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero, 5Patch-For-Review: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1614392 (10jcrespo) p:5Triage>3Normal [17:51:30] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1592889 (10jcrespo) [18:12:33] (03PS1) 10Yuvipanda: tools: Add etcd role [puppet] - 10https://gerrit.wikimedia.org/r/236582 [18:14:10] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1614425 (10Krenair) We can add entries to wgRateLimitsExcludedIPs in mediawiki-config if there is a list of internal hosts which need i... [18:14:56] (03PS2) 10Yuvipanda: tools: Add etcd role [puppet] - 10https://gerrit.wikimedia.org/r/236582 [18:15:06] !log graceful’d apache, restarted keystone on labcontrol1001 [18:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:39] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add etcd role [puppet] - 10https://gerrit.wikimedia.org/r/236582 (owner: 10Yuvipanda) [18:29:18] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1614454 (10yuvipanda) So I think we should bring this back up, for kubernetes in toollabs is going to be using etcd, and it should probably not be accessible to anyo... [18:33:46] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1614473 (10jcrespo) [18:33:48] 6operations, 6Labs, 10wikitech.wikimedia.org, 7HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1614471 (10jcrespo) 5Open>3stalled As there is some disagreement here, trying to be neutral here and just reflecting the current state of this task. [18:34:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [18:34:20] (03PS1) 10Yuvipanda: tools: Do not poke holes in firewall for etcd [puppet] - 10https://gerrit.wikimedia.org/r/236590 [18:34:42] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Do not poke holes in firewall for etcd [puppet] - 10https://gerrit.wikimedia.org/r/236590 (owner: 10Yuvipanda) [18:36:56] 6operations, 10Traffic, 7RESTBase-API: ru.wikinews.org Parsoid backend is down - https://phabricator.wikimedia.org/T111715#1614481 (10Krenair) [18:37:26] (03PS1) 10Alex Monk: Remove seemingly-obsolete apache redirect for ruwikinews [puppet] - 10https://gerrit.wikimedia.org/r/236591 (https://phabricator.wikimedia.org/T111715) [18:38:47] (03CR) 10GWicke: [C: 031] Remove seemingly-obsolete apache redirect for ruwikinews [puppet] - 10https://gerrit.wikimedia.org/r/236591 (https://phabricator.wikimedia.org/T111715) (owner: 10Alex Monk) [18:39:45] 6operations: Do not apply spam headers on email assessed NOT to be spam - https://phabricator.wikimedia.org/T111595#1614492 (10jcrespo) [18:40:54] 6operations: Do not apply spam headers on email assessed NOT to be spam - https://phabricator.wikimedia.org/T111595#1614494 (10jcrespo) p:5Triage>3Normal From: T110761: > Thanks for the clarification. > > This matters because there are a few cases where Google's Spam filtering is not working very well. (... [18:41:05] (03CR) 10Kelson: [C: 031] Remove seemingly-obsolete apache redirect for ruwikinews [puppet] - 10https://gerrit.wikimedia.org/r/236591 (https://phabricator.wikimedia.org/T111715) (owner: 10Alex Monk) [18:43:33] (03PS3) 10Alex Monk: Redirect be-x-old.wikipedia.org to be-tarask.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/235943 (https://phabricator.wikimedia.org/T11823) [18:44:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [18:47:17] 6operations, 10Traffic, 5Patch-For-Review, 7RESTBase-API: ru.wikinews.org Parsoid backend is down - https://phabricator.wikimedia.org/T111715#1614504 (10Krenair) a:3Krenair [18:50:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [18:53:05] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:53:06] (03PS6) 10Ori.livneh: Collect missing Navigation Timing metrics [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) (owner: 10Phedenskog) [18:53:13] (03CR) 10Ori.livneh: [C: 032 V: 032] Collect missing Navigation Timing metrics [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) (owner: 10Phedenskog) [18:55:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:58:32] (03CR) 10GWicke: [C: 031] "The RESTBase change looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/235943 (https://phabricator.wikimedia.org/T11823) (owner: 10Alex Monk) [19:02:11] (03PS2) 10Alex Monk: beta apache config: remove nonsensical rewrites [puppet] - 10https://gerrit.wikimedia.org/r/236567 [19:12:20] YuviPanda: do you know where I can find apt.wm.o's GPG key? [19:12:27] https://github.com/wikimedia/operations-debs-wikimedia-keyring/tree/master/debian doesn't seem to contain it [19:12:28] valhallasw`cloud: no... [19:12:58] valhallasw`cloud: I suppose you could pick it out of a labs instance? [19:13:20] it's not in /usr/share/keyrings O_o [19:15:54] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: puppet fail [19:16:56] "ssh tools-login apt-key export 09DBD9F93F6CD44A | sudo apt-key add -" [19:16:56] gah. [19:17:04] RECOVERY - Host cr1-eqord is UP: PING OK - Packet loss = 0%, RTA = 125.50 ms [19:18:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [19:20:55] 6operations, 6Phabricator, 7Database: phabricator metrics script should use slave, not master - https://phabricator.wikimedia.org/T111547#1614571 (10jcrespo) p:5Triage>3Low Low until it affects production. [19:21:24] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 28, down: 5, dormant: 0, excluded: 2, unused: 0BRfxp0: down - BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 Telia (IC-307236) {#?} [10Gbps DWDM]BRxe-1/0/0: down - Core: cr2-eqiad:xe-4/2/0 Telia (IC-307236) {#?} [10Gbps DWDM]BRxe-0/0/2: down - BRxe-0/0/3: down - BR [19:21:47] YuviPanda: can you add the output of apt-key export 09DBD9F93F6CD44A to some secure page on wikitech? [19:21:55] 6operations, 7Database, 5WMF-NDA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1614575 (10jcrespo) [19:22:08] 6operations, 7Database, 5WMF-NDA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1614577 (10jcrespo) 5Open>3stalled [19:22:40] valhallasw`cloud: is that the apt.wm.org key? [19:24:02] it made adminbot install locally :-p but it's not listed in apt-key lits which I don't understand [19:24:08] (03PS3) 10Alex Monk: tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 [19:24:26] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1614590 (10jcrespo) [19:24:29] valhallasw`cloud: https://wikitech.wikimedia.org/wiki/APT_repository/Key [19:24:59] (03CR) 10jenkins-bot: [V: 04-1] tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [19:26:28] 6operations, 10hardware-requests, 7Database, 5Patch-For-Review: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1614596 (10jcrespo) After several clones and failovers, the new nodes are working as masters, alongside the old nodes. I will now slowly start the slowly depooling of... [19:31:10] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1614610 (10jcrespo) I am soon going to resolve T105843#1614596. Please be prepared soon to recontinue this task! Please keep me in the loop, so that I can help, and provide a rollback plan in case... [19:31:44] PROBLEM - BGP status on cr1-eqord is CRITICAL: CRITICAL: No response from remote host 208.80.154.198 [19:33:54] PROBLEM - Host cr1-eqord is DOWN: CRITICAL - Network Unreachable (208.80.154.198) [19:35:30] (03CR) 10jenkins-bot: [V: 04-1] tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [19:36:28] (03CR) 10Gilles: [C: 031] Remove all files except README [apache-config] - 10https://gerrit.wikimedia.org/r/235222 (owner: 10Krinkle) [19:38:05] 6operations, 7Database: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1614626 (10Krenair) [19:40:03] RECOVERY - Host cr1-eqord is UP: PING OK - Packet loss = 0%, RTA = 125.70 ms [19:40:51] (03PS5) 10Alex Monk: tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 [19:42:24] Krenair: are you merging tcpircbot and ircecho? [19:42:44] something like that [19:42:47] how'd you guess? [19:43:37] Krenair: well, you're touching both of those and the fact that they both exist has been bothering me for the longest time [19:43:51] Krenair: I wrote the 'ircyall' module / package to 'unify' them and of course that ended up fragmenting them instead [19:43:54] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:46:35] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1614645 (10Gilles) [19:49:25] (03PS5) 10Nemo bis: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:49:33] (03CR) 10jenkins-bot: [V: 04-1] Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:51:20] (03PS6) 10Nemo bis: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:51:26] (03CR) 10jenkins-bot: [V: 04-1] Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:52:37] (03PS7) 10Nemo bis: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:52:41] (03CR) 10Nemo bis: "Rebased. (LOL.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [19:56:54] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1592958 (10Gilles) [19:59:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [19:59:10] (03PS2) 10Ori.livneh: Disable fss.so on all HHVM servers [puppet] - 10https://gerrit.wikimedia.org/r/236046 (https://phabricator.wikimedia.org/T101418) [20:00:22] (03CR) 10Ori.livneh: [C: 032 V: 032] Disable fss.so on all HHVM servers [puppet] - 10https://gerrit.wikimedia.org/r/236046 (https://phabricator.wikimedia.org/T101418) (owner: 10Ori.livneh) [20:00:25] * YuviPanda is pushing all of kubernetes onto gerrit [20:00:27] I hope it doesn't die [20:03:30] YuviPanda, is ircyall actually used anywhere? [20:03:55] Krenair: yes, by valhallasw`cloud's fabric scripts [20:03:58] unfortunately, nowhere else [20:04:24] you know there's a relevant xkcd for this [20:04:37] :) [20:05:06] I do [20:05:14] I don't mind it getting killed :) [20:12:34] * valhallasw`cloud eyes YuviPanda [20:12:53] I'll hack it to wm-bot at some point then :-p [20:13:45] you should! [20:13:50] oh wait [20:13:51] oooh [20:15:50] (03PS1) 10Merlijn van Deen: [WIP DO NOT MERGE] toollabs: replace package{} by require_package() [puppet] - 10https://gerrit.wikimedia.org/r/236616 [20:16:46] (03CR) 10jenkins-bot: [V: 04-1] [WIP DO NOT MERGE] toollabs: replace package{} by require_package() [puppet] - 10https://gerrit.wikimedia.org/r/236616 (owner: 10Merlijn van Deen) [20:21:59] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove all files except README [apache-config] - 10https://gerrit.wikimedia.org/r/235222 (owner: 10Krinkle) [20:27:43] (03CR) 10Ori.livneh: tcpircbot: Also take input from files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [20:31:48] (03CR) 10Ori.livneh: [C: 031] "Looks good; haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [20:47:23] 6operations, 10Beta-Cluster, 7Graphite, 7Shinken: Delete more specific deployment-prep graphite datapoints - https://phabricator.wikimedia.org/T111540#1614837 (10hashar) [20:48:37] 6operations, 10Beta-Cluster, 7Graphite, 7Shinken: Delete more specific deployment-prep graphite datapoints - https://phabricator.wikimedia.org/T111540#1606894 (10hashar) Seems the broad issue is to have a garbage collector on the labs graphite. A low hanging fruit would be to delete all metrics for instan... [20:53:10] (03PS6) 10Alex Monk: tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 [21:03:04] (03PS1) 10Merlijn van Deen: package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111739) [21:03:52] (03CR) 10jenkins-bot: [V: 04-1] package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111739) (owner: 10Merlijn van Deen) [21:10:08] FYI I am planning to deploy several cherry-picks to move to using a newer version of the Edit schema in a few minutes [21:12:09] (the EventLogging schema used by VE, WE, and soon MF) [21:26:48] (03PS2) 10Merlijn van Deen: package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111739) [21:42:45] !log krenair@tin Synchronized php-1.26wmf21/extensions/WikiEditor: https://gerrit.wikimedia.org/r/#/c/236197/1 and https://gerrit.wikimedia.org/r/#/c/236679/ (duration: 00m 12s) [21:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:58] !log krenair@tin Synchronized php-1.26wmf21/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/#/c/236196/1 (duration: 00m 12s) [21:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:45] !log krenair@tin Synchronized php-1.26wmf21/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/236682/ (duration: 00m 12s) [21:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:54] Whee. [21:46:09] James_F, do you know how long it should take for events to read the log database? [21:46:19] Krenair: A minute or so, I think. [21:46:20] Krenair: Can you confirm new data is going into the new table? [21:46:31] Oh, yup [21:46:38] It's just created the table [21:46:59] VE and WE both going in? [21:47:27] only wikitext editor so far... that was first though [21:47:33] * James_F nods. [21:47:45] And JS, not PHP. [21:47:47] So cached… [21:47:56] true [21:48:21] there we go, just saw a few VE entries [21:48:31] seems to be doing them in batches of 400 [21:49:39] Oh, nice. [21:53:08] I hope it's just waiting for each set of 400 to insert in one go rather than us going over some sort of limit :/ [22:37:55] (03PS1) 10Ori.livneh: Remove unused file `jobrunner.hhvm.hdf`. [puppet] - 10https://gerrit.wikimedia.org/r/236686 [22:38:14] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove unused file `jobrunner.hhvm.hdf`. [puppet] - 10https://gerrit.wikimedia.org/r/236686 (owner: 10Ori.livneh) [22:43:52] (03PS1) 10Alex Monk: Fix restbase on test.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/236687 [22:49:05] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 1 failures [22:51:04] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:52:23] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: puppet fail [22:54:53] RECOVERY - BGP status on cr1-eqord is OK: OK: host 208.80.154.198, sessions up: 1, down: 0, shutdown: 0 [22:56:14] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:01:15] Jamesofur: i don't know who to poke, so using you as a proxy: typo at network plus double with at https://boards.greenhouse.io/wikimedia/jobs/84710?t=n6of1e [23:13:24] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100430s 100000s) [23:34:43] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [23:34:43] @info 10.64.16.13 [23:34:43] Krinkle: [10.64.16.13: s2] db1024 [23:43:14] matanya: I'll let HR know :) [23:43:31] thanks