[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151110T0000). [00:01:59] bd808, you appear to be the only one... [00:02:19] oh look at the time [00:02:28] want to just do it? [00:02:39] yeah :/ [00:02:40] sure [00:04:03] (03PS3) 10BryanDavis: Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) [00:04:11] (03CR) 10BryanDavis: [C: 032] Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [00:04:34] (03Merged) 10jenkins-bot: Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [00:05:17] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:05:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [00:06:56] !log bd808@tin Synchronized wmf-config/logging.php: Disable microsecond timestamps on all loggers (bf61bfc) (duration: 00m 34s) [00:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:15] looks good to me [00:08:31] bd808: I like how you managed to shred miliseconds here and there and even come up with a patch for upstream https://github.com/Seldaek/monolog/pull/658 :-} [00:09:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [00:09:21] most of the credit goes to ori for finding the problem in the first place [00:09:46] team work! [00:09:48] Thanks to him I've been trying to make logging less of a hotspot in the profiling graphs [00:10:02] also didn't I send you to bed hours ago? [00:10:29] packet loss [00:10:38] no, you ack'ed [00:10:41] another thing I wondered is if hhvm had some preprocessing macro [00:10:53] to get rid of wfDebug() calls for example [00:11:14] or maybe have hhvm rewrite that as a nil / nop function [00:11:21] Not that I've seen, and if it did it wouldn't be portable [00:11:29] MW is not a hack app [00:11:50] although it would be nice to kill off PHP 5.3 and 5.4 support [00:12:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 6 below the confidence bounds [00:13:38] !log Attached Richardelainechambers@commonswiki to the global account of the same name [00:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:11] * bd808 declares SWAT done [00:14:50] bd808: Thankfully, ops are still moving towards that happening! [00:15:21] yeah! Glad to see that moving forward again [00:15:55] Next stop, PHP7 [00:16:26] having helped with hhvm I'm not excited to start the migration soon [00:16:45] lots of extensions to rewrite yet again [00:16:57] and this time backwards compat is hard [00:17:01] (03CR) 10Dzahn: "the only diff is the motd file:" [puppet] - 10https://gerrit.wikimedia.org/r/251554 (owner: 10Dzahn) [00:17:57] anyway time to process with the sleep() queue. *wave* [00:18:12] (03PS2) 10Dzahn: smokeping: create a role class, use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/251554 [00:18:45] Reedy, I demand an intermediary PHP6 support [00:18:53] MaxSem: glhf [00:21:13] (03CR) 10Dzahn: [C: 032] smokeping: create a role class, use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/251554 (owner: 10Dzahn) [00:21:53] Just need tin/terbium stuff done completely, then we can bump to PHP 5.5 I guess [00:25:09] (03PS2) 10Dzahn: smokeping: add ferm rules for http to role [puppet] - 10https://gerrit.wikimedia.org/r/251557 (https://phabricator.wikimedia.org/T105410) [00:26:22] (03PS3) 10Dzahn: smokeping: add ferm rules for http to role [puppet] - 10https://gerrit.wikimedia.org/r/251557 (https://phabricator.wikimedia.org/T105410) [00:26:42] (03CR) 10Dzahn: [C: 032] "amended, only http here, no https" [puppet] - 10https://gerrit.wikimedia.org/r/251557 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [00:26:44] Reedy: I *think* that's right. Dumps are still zend I think but on 5.6 or whatever we got in 14.04 [00:26:59] Isn't it 5.5 like silver/wikitech? [00:27:40] probably [00:27:57] dataset1001 has no php packages, if it's that [00:29:50] mutante: the snapshot servers are what I'm thinking about. I think they all get reimaged but there is an hhvm + bzip2 streams bug still that we haven't gotten fixed afaik [00:29:55] (03Abandoned) 10Dzahn: elasticsearch: move base::firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/250078 (owner: 10Dzahn) [00:30:05] ja [00:30:35] there's a umstream patch but it needs tweaking for our hhvm build due to missing changes in between [00:30:40] *upstream [00:30:47] bd808: gotcha.. so in that case.. confirmed. it's 5.5.9 [00:56:02] (03PS2) 10Ori.livneh: Add perf-roots to webperf role (as part of I583d9a571) [puppet] - 10https://gerrit.wikimedia.org/r/252114 [00:56:24] (03CR) 10Ori.livneh: [C: 032] "I sincerely believe that this is logically entailed by approving T117256, so I'm going to go ahead and merge this." [puppet] - 10https://gerrit.wikimedia.org/r/252114 (owner: 10Ori.livneh) [00:56:41] (03CR) 10Ori.livneh: [V: 032] "I sincerely believe that this is logically entailed by approving T117256, so I'm going to go ahead and merge this." [puppet] - 10https://gerrit.wikimedia.org/r/252114 (owner: 10Ori.livneh) [01:13:15] (03CR) 10TTO: Enable import from any Beta Cluster project to another (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [01:13:17] (03CR) 10Krinkle: Make mysql-multiwrite use getInstance() factory spec (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [01:15:00] (03PS12) 10TTO: Enable cluster-wide import setup on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [01:16:22] (03CR) 10MZMcBride: [C: 04-1] "As the person who's probably worked the most on www.wikimedia.org, I find it really distasteful that the "Discovery" team has come in and " [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [01:19:55] (03CR) 10Aaron Schulz: Make mysql-multiwrite use getInstance() factory spec (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [01:24:25] (03CR) 10Krinkle: [C: 031] $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [01:27:28] (03CR) 10Krinkle: "Do all consumers support omission of port? E.g. EventLogging and Monolog legacy handler. It seems Monolog LegacyHandler in mediawiki/core " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [01:33:32] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1795002 (10Rjd0060) And more specifically, I (and others) report slow logins and delays when loading certain queues. [01:37:50] (03CR) 10Krinkle: Make mysql-multiwrite use getInstance() factory spec (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [02:10:57] (03PS1) 10Dzahn: peopleweb: add motd message about public_html [puppet] - 10https://gerrit.wikimedia.org/r/252150 [02:13:06] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [100000000.0] [02:15:58] (03PS1) 10Dzahn: peopleweb: backup home directories [puppet] - 10https://gerrit.wikimedia.org/r/252152 (https://phabricator.wikimedia.org/T116992) [02:16:13] (03CR) 10Dzahn: [C: 032] peopleweb: add motd message about public_html [puppet] - 10https://gerrit.wikimedia.org/r/252150 (owner: 10Dzahn) [02:16:57] (03CR) 10Dzahn: [C: 032] peopleweb: backup home directories [puppet] - 10https://gerrit.wikimedia.org/r/252152 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [02:19:44] (03PS1) 10Dzahn: peopleweb: remove role from terbium [puppet] - 10https://gerrit.wikimedia.org/r/252153 (https://phabricator.wikimedia.org/T116992) [02:21:08] ACKNOWLEDGEMENT - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail daniel_zahn migration of peopleweb [02:22:30] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 06m 49s) [02:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:57] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:25:19] (03PS2) 10Dzahn: misc: add rutherfordium and point people.wm.o to it [puppet] - 10https://gerrit.wikimedia.org/r/251115 (https://phabricator.wikimedia.org/T116992) (owner: 10John F. Lewis) [02:25:51] (03CR) 10Dzahn: [C: 032] misc: add rutherfordium and point people.wm.o to it [puppet] - 10https://gerrit.wikimedia.org/r/251115 (https://phabricator.wikimedia.org/T116992) (owner: 10John F. Lewis) [02:31:12] !log switched people.wm.org to new backend. please use "rutherfordium.eqiad.wmnet" instead of terbium now. all files have been rsynced already [02:32:17] (03PS2) 10Dzahn: peopleweb: remove role from terbium [puppet] - 10https://gerrit.wikimedia.org/r/252153 (https://phabricator.wikimedia.org/T116992) [02:33:01] (03CR) 10Dzahn: [C: 032] peopleweb: remove role from terbium [puppet] - 10https://gerrit.wikimedia.org/r/252153 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [02:48:48] (03PS1) 10Dzahn: peopleweb: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/252154 (https://phabricator.wikimedia.org/T116992) [02:49:14] (03PS2) 10Dzahn: peopleweb: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/252154 (https://phabricator.wikimedia.org/T116992) [02:49:54] (03CR) 10Dzahn: [C: 032] peopleweb: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/252154 (https://phabricator.wikimedia.org/T116992) (owner: 10Dzahn) [02:53:48] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [10.0] [02:53:48] PROBLEM - Kafka Broker Server on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [02:53:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [10.0] [02:54:16] got paged because of the kafka thing [02:54:17] looks [02:54:27] hmm [02:54:39] I'm not sure what to do here [02:55:12] !log tried to start kafka on kafka1018 [02:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:20] hi [02:55:20] ottomata: hi [02:55:43] got pinged! am looking, what's up? [02:55:44] root@kafka1018:~# service kafka status [02:56:05] ottomata: afka Broker Server on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java [02:56:15] yeah, it just died? [02:56:15] but: [02:56:16] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [10.0] [02:56:17] Active: active (running) [02:56:28] ugh, there's another one now [02:56:36] did you start yet? [02:56:51] that one just started, might have been puppet if it wasn't you [02:56:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [02:57:04] i did not touch kafka1022 yet [02:57:12] and on 1018..it's like the status was already running [02:57:17] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [10.0] [02:57:57] they are all restarting? [02:57:58] 1022? [02:58:00] oh that's fine [02:58:03] the status is "running" [02:58:04] no that's not restarting [02:58:06] without doing anything [02:58:25] Ok good :) popped in to see if otto was around, kafka paged fyi [02:58:28] that's another alert, that will always happen if a broker goes down, some of the partitions will be 'under replicated' [02:58:38] so why is it "0 processes" then [02:58:47] aye, thanks guys, not sure what's happening yet, looks ok though [02:58:47] hm [02:58:48] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [10.0] [02:58:50] mutante: yeah, on 1013? [02:58:51] hm [02:58:55] 0 processes is not ok [02:59:08] wait, mutante did you restart 1013 or 1018? [02:59:11] i got pinged about 1013 [02:59:19] 1013 sent an SMS [02:59:31] ok yeah, not running on 1013 [02:59:31] but 1018 and 1022 and 1014 are reporting as well on IRC [02:59:33] i saw you typed 1018 [02:59:34] just not paging [02:59:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [02:59:40] and it's all at the same time [02:59:42] oh ok, yeah, its going to do that^ [02:59:49] under replicated is a consequence of a broker being down [02:59:59] the other brokers are saying 'hey! this partition isn't fully replicated at the moment' [03:00:07] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [03:00:10] just 1013 pinged as 0 procecsses, right? [03:00:28] yes, and it crashed there [03:00:33] starting the right one [03:00:40] Nov 10 02:49:59 kafka1013 systemd[1]: kafka.service: main process exited, code=killed, status=6/ABRT [03:00:43] Nov 10 02:49:59 kafka1013 systemd[1]: Unit kafka.service entered failed state. [03:00:54] !log started kafka on kafka1013 [03:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [03:01:23] ottomata: yes, only that one paged. i happened to see the icinga-wm output and that it was the same one [03:01:29] thought [03:01:48] ok [03:02:11] RECOVERY - Kafka Broker Server on kafka1013 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [03:02:24] there we go [03:02:33] hm, yeah, but, why? :) [03:03:22] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [100000000.0] [03:03:45] hm, ok mutante, did you happen to capture what service kafka status said? i can't seem to find anything useful in logs [03:04:08] ottomata: the thing i pasted above with "entered failed state" , status=6/ABRT [03:04:23] nothing else? I thought i saw something about a pid log file or something [03:04:47] Main PID: 29576 (code=killed, signal=ABRT) [03:04:54] hm, i thought i saw somethign about some captured output [03:04:57] ok [03:05:10] i got nuthin! no clues as to why that happened [03:05:12] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [10.0] [03:05:41] Nov 10 02:49:59 kafka1013 systemd[1]: kafka.service: main process exited, code=killed, status=6/ABRT [03:05:57] ottomata: https://phabricator.wikimedia.org/P2296 [03:05:59] hm hm hm [03:06:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:06:31] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [03:06:45] ottomata: it says /tmp/hs_err_pid29576.log [03:06:57] cool! [03:06:57] # A fatal error has been detected by the Java Runtime Environment: [03:06:57] # [03:06:57] # SIGSEGV (0xb) at pc=0x00007fa2ba7f0167, pid=29576, tid=140336724678400 [03:06:58] haha [03:06:59] yea, there's a bunch in that file [03:07:21] fatal error ..nice [03:07:42] Failed to write core dump. Core dumps have been disabled. [03:08:41] yeah, hm, ok [03:09:33] looks like some crazy segfault during garbage collections [03:09:36] collection [03:09:37] hm [03:10:56] !log kafka preferred-replica-election to bring kafka1013 back as leader for its partitions [03:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 1.00% above the threshold [1.0] [03:11:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 1.00% above the threshold [1.0] [03:12:11] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 1.00% above the threshold [1.0] [03:12:21] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 1.00% above the threshold [1.0] [03:12:42] !log restarted statsv on hafnium just in case it stopped after kafka1013 broker crash and restart (ping ori :) ) [03:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 1.00% above the threshold [1.0] [03:18:31] PROBLEM - salt-minion processes on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:22:16] mutante: thanks for responding [03:22:31] gonna close compy, [03:22:44] ottomata: welcome. same here. ttyl [03:24:03] nighters! [03:24:12] RECOVERY - salt-minion processes on labtestcontrol2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:27:14] 6operations, 5Patch-For-Review: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1795075 (10Dzahn) see changes above. it's been switched to rutherfordium. all files have been rsynced, i mailed ops and wikitech, logged to SAL and... [03:27:39] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1795076 (10Dzahn) [03:28:10] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1795077 (10Dzahn) 5Open>3Resolved [03:30:18] (03PS1) 10Dzahn: peopleweb: remove migration role from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/252157 [03:31:11] (03PS2) 10Dzahn: peopleweb: remove migration role from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/252157 [03:31:22] (03CR) 10Dzahn: [C: 032] peopleweb: remove migration role from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/252157 (owner: 10Dzahn) [04:28:52] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:38:48] (03CR) 10Krinkle: [C: 04-1] $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [05:03:22] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [5000000.0] [05:09:42] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [05:18:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [05:19:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:29:41] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [06:16:41] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1795140 (10Peter) @ori yes I emailed Pat, @BBlack on cc, see if we could find the right people to talk to. [06:20:23] PROBLEM - salt-minion processes on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:28:03] RECOVERY - salt-minion processes on labtestcontrol2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:31:22] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:53] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:01] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:01] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on protactinium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:01] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:12] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:12] PROBLEM - puppet last run on mw2057 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:22] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:33] PROBLEM - puppet last run on mw2046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:33] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:11] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [06:52:31] RECOVERY - puppet last run on protactinium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:54:22] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:55:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [06:55:42] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:59:11] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [07:04:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [07:08:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 4 below the confidence bounds [07:21:12] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:23:41] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:23:53] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:24:31] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:25:12] RECOVERY - puppet last run on mw2057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:25:32] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:25:33] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:02] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:26:18] (03PS5) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [07:27:12] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:28:01] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:29:23] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:33:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [07:42:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [07:55:53] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [08:07:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [08:07:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [08:25:10] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1795203 (10akosiaris) **TL;DR** It is to be expected, has nothing to do with the software or the VM but with the "test" nature of that install. Such problems will not be present in the production inst... [08:26:41] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [08:29:48] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1795219 (10Krd) Understood so far, but having that we cannot test functions that use expensive database queries when they run into timeouts, besides that it's not a lot of fun to do tests at all on slow... [08:33:05] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1795224 (10akosiaris) >>! In T74109#1795219, @Krd wrote: > Understood so far, but having that we cannot test functions that use expensive database queries when they run into timeouts, besides that it's... [08:34:26] (03CR) 10Florianschmidtwelzow: [C: 04-1] Enable RelatedArticles on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) (owner: 10Bmansurov) [08:35:03] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:39:43] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1795228 (10Krd) Some of the admin interface views did, sadly not reproduceable at the moment. [08:44:52] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:53:11] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:53:52] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:54:41] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:54:51] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:55:22] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:23:03] 6operations, 7Database: Bug on MariaDB use_stat_tables - https://phabricator.wikimedia.org/T118079#1795251 (10jcrespo) As per the bug, hot-patch the current installation updating tables to binary format or upgrade to 10.0.23. [09:28:18] (03PS1) 10Nemo bis: [Planet Wikimedia] Add Kunal Meta/legoktm to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/252170 [09:37:39] (03PS1) 10Jcrespo: Depool db1035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252171 [09:38:25] (03CR) 10Jcrespo: [C: 032] Depool db1035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252171 (owner: 10Jcrespo) [09:39:54] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1035 for maintenance (duration: 00m 36s) [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:07] !log swift codfw-prod: ms-be2017 / ms-be2019 / ms-be2021 weight 2000 [09:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:19] (03PS1) 10Muehlenhoff: Add patch for CVE-2015-5307 (not merged upstream yet) [debs/linux] - 10https://gerrit.wikimedia.org/r/252174 [09:58:36] (03PS3) 10Bmansurov: Use CirrusSearch API in RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) [10:00:45] !log performing mysql maintenance of db1035 (depooled from production). Replication delay and reboots are expected. [10:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:43] (03PS1) 10Giuseppe Lavagetto: maintenance: run at most one wikidata dispatchChanges instance at a time [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) [10:11:58] <_joe_> jynus, godog, akosiaris, please take a look ^^ [10:12:50] <_joe_> and whoever else is around, I don't think we ever used run-once before [10:13:52] aude ^ [10:14:16] it might be problematic :/ [10:14:36] <_joe_> aude: it's problematic now [10:14:43] i know [10:14:52] we have one dba and it's keeping him busy, so it's a bit of a crisis [10:14:52] <_joe_> aude: let me show you how many db connection the thing is making [10:14:54] this really should use the job queue [10:15:01] <_joe_> aude: yes [10:15:05] <_joe_> this really should [10:16:10] <_joe_> root@mw1152:~# netstat -tuap | grep :mysql | grep php | wc -l [10:16:10] <_joe_> 4115 [10:16:18] <_joe_> this is worse than unacceptable [10:16:22] :( [10:16:24] <_joe_> even if we had 5 dbas [10:16:35] it is not a crisis [10:16:41] I am concerned [10:16:47] but it is not causing problems [10:17:08] however, it may be one of the reasons of connection problems on db1035 [10:17:11] i suppose we could try to run these a little more infrequently and monitor dispatching [10:17:13] <_joe_> jynus: it's just unacceptable and wrong, and we have 8 instances of this always running at the same time [10:17:32] <_joe_> aude: no we should move this to the jobqueue, period [10:17:34] I agree, I am clarifying that it is not an outage [10:17:55] <_joe_> oh if it was an outage I would've simply yanked the cron :) [10:18:00] <_joe_> err deleted [10:18:12] but I believe it is creating issues both in disk space and connections [10:18:36] (open transactions makes the UNDO space grow) [10:18:48] <_joe_> aude: also, who is owning this cronjob? is anyone? [10:18:49] _joe_: it's not something i can do instantly but can raise to highest priority [10:19:03] <_joe_> how can I get the attention of someone on this on phabricator? [10:19:20] <_joe_> I might not have set all the right labels [10:19:24] i think Lydia_WMDE already knows [10:19:40] thanks aude [10:19:44] hey [10:19:45] It was creating 1700 connections to db1035, that is why I was alarmed [10:19:48] <_joe_> yes, thanks a lot :) [10:19:57] <_joe_> jynus: it's creating more than 4k connections [10:20:20] (03CR) 10Filippo Giunchedi: [C: 04-1] maintenance: run at most one wikidata dispatchChanges instance at a time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:20:22] yeah, but over 300 hosts is not as alarming as making one host fail to connect [10:20:27] <_joe_> Lydia_WMDE: the dispactchCanges scripts for wikidata are creating very serious issues as they are [10:20:31] i would try making the jobs less infrequent as we can probably cope [10:20:34] yes we'll put it in the next sprint. until then we need to keep it running at least in some form because otherwise wikipedians will be on extremely upset because they don't get notified about changes and pages don't get purged for new data [10:20:54] let's make them less frequent as a next step [10:20:57] maybe */4 or */5 [10:21:20] <_joe_> godog: root@mw1152:~# which run-one [10:21:20] <_joe_> /usr/bin/run-one [10:21:55] oh! my mistake [10:22:26] aude: will you make the change? [10:22:29] <_joe_> aude: that would reduce the number of instances of it by one [10:22:34] i'll look for the ticket for switching to job queue [10:22:35] <_joe_> Lydia_WMDE: I'm on it [10:22:51] <_joe_> Lydia_WMDE: I opened an UBN! ticket for this [10:23:04] <_joe_> https://phabricator.wikimedia.org/T118162 [10:23:08] _joe_: 1 will probably not be enough [10:23:26] <_joe_> Lydia_WMDE: the ones we have now are not enough apparently [10:23:47] <_joe_> since they all run until their programmed timeout [10:23:59] we could shorten --max-time [10:24:05] https://phabricator.wikimedia.org/T48643 [10:24:09] <_joe_> aude: that's my alternative, yes [10:24:33] let's try that and then this definitely needs to be high priority to move to job queue [10:26:05] (03CR) 10Filippo Giunchedi: maintenance: run at most one wikidata dispatchChanges instance at a time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:26:16] _joe_: ah yeah, ubuntu-only thing [10:26:33] <_joe_> godog: I am going to change the logic anyways [10:26:45] <_joe_> aude: what does dispatchChanges do exaclty? [10:26:53] <_joe_> because in the puppet manifest I see [10:26:55] <_joe_> # This handles inserting jobs into client job queue, which then process the changes [10:27:11] <_joe_> so we are already inserting the jobs in the jobqueue? [10:27:21] <_joe_> why with a script and not a hook on edits? [10:28:25] _joe_: the script predates having delayed jobs etc. [10:28:45] <_joe_> aude: so that comment is wrong? [10:28:46] the changes get coalesced and other stuff [10:29:07] jobqueue = client jobqueue (so currect) [10:29:12] correct* [10:29:40] <_joe_> what is the client jobqueue? [10:29:48] <_joe_> I fear I miss some terms here [10:29:49] e.g. jobqueue of wikipedia [10:29:53] <_joe_> ok [10:29:59] <_joe_> thanks :) [10:30:08] or other wikis that reference wikidata items (eg in an infobox) [10:30:21] so the infobox gets updated when the item changes [10:30:53] obviously a cronjob is not a good soultion now [10:30:56] <_joe_> and now we enqueue one job per wiki, is that correct? [10:31:21] we made some steps to refactor the code and i think moving to the job queue might be more possible now (but non-trivial) [10:31:41] it's one job per wiki (for coalesced list of changes) [10:32:01] we know on wikidata, which wikis are subscribed to an item [10:32:28] so can decide which wikis get notified about a particular change (to the item) [10:32:57] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, 5Patch-For-Review: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795305 (10jcrespo) Probably related to T107072, too. [10:33:22] <_joe_> oh [10:33:39] <_joe_> so yeah instead of this maintenance script, that would be better [10:33:50] yeah [10:34:20] jobs in the wikidata jobqueue (post edit) can do what the script does now [10:34:27] "it's one job per wiki" the problem is that there are 900 wikis on s3 [10:34:36] i see [10:34:50] that doesn't work [10:35:13] i'd have to look at the code again (or daniel look), but think maybe this can be reworked some [10:35:15] plus, connections are held while idling [10:35:30] they are not executing queries, they are in Sleep mode [10:35:34] that is a no-go [10:35:35] :( [10:36:02] (03CR) 10Hoo man: [C: 04-1] "I doubt one instance is enough during peak times. Also I don't want to get paged about this more often..." [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:36:09] a connection should not be open for more than seconds- otherwise a database may be depooled or lagged [10:36:15] probably fixing this is more critical than simply moving to the jobqueue [10:36:34] <_joe_> where is hoo? [10:36:46] <_joe_> he doesn't want to get paged, that's a shared feeling [10:37:48] (03PS2) 10Giuseppe Lavagetto: maintenance: run at most three wikidata dispatchChanges instance at a time [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) [10:37:51] so in order for him not to get paged, we have to bring down 900 wikis, so we all do? [10:39:00] (03CR) 10Giuseppe Lavagetto: "@hoo: I was already asked to let at least 3 run in parallel. I would say that something that makes 4K constant connections to databases (t" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:39:26] <_joe_> jynus: I was a bit less explicit, but said the same :P [10:41:44] <_joe_> aude: so now I reduce the workers to 3, I hope that's enough [10:41:49] <_joe_> let's see what happens [10:42:32] PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: puppet fail [10:42:39] i don't know if it is enough but think worth a try [10:43:40] <_joe_> ok let's do it, we can raise the number afterwards if strictly needed [10:43:46] ok [10:43:54] <_joe_> and if I get a clear commitment on a date in which this will be fixed [10:44:04] <_joe_> else, we'll have to live with the lag for now [10:44:23] * aude just got back from ~18 hours on airplane and daniel is not around this week [10:44:31] <_joe_> aude: ah, sorry :( [10:44:31] RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:44:56] <_joe_> aude: and thanks for working on this after a plane ride; please rest :) [10:44:58] so can't promise to work on this today but maybe later this week or next sprint (nextx week) as lydia says [10:45:04] heh [10:45:04] yeah [10:45:14] <_joe_> aude: I was searching for a commitment from Lydia_WMDE actually :D [10:45:28] i can promise to put it in the next sprint which starts in 1 week [10:46:14] <_joe_> well, I consider this an UBN! priority issue, but since it's been around for so long, we can wait another week; and maybe deal with some dispatch lag [10:46:35] if katie gets around to it earlier that is fine with me [10:46:42] but daniel is gone until monday [10:46:49] so he can't help until then probably [10:47:03] <_joe_> heh, understaffed we are :) [10:47:11] indeed [10:47:28] <_joe_> ok lemme merge this [10:49:29] (03PS2) 10ArielGlenn: dumps: clean up many comments of methods for dumps jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252133 [10:49:31] (03PS2) 10ArielGlenn: dumps: clean up docstrings for recompress jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252134 [10:49:39] <_joe_> ok, I am going to merge the change now [10:49:43] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: run at most three wikidata dispatchChanges instance at a time [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:50:05] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: run at most three wikidata dispatchChanges instance at a time [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:50:15] <_joe_> stupid gerrit [10:51:32] (03CR) 10Jcrespo: "+1" [puppet] - 10https://gerrit.wikimedia.org/r/252179 (https://phabricator.wikimedia.org/T118162) (owner: 10Giuseppe Lavagetto) [10:54:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mediawiki: rename jobqueue alert 'job-pop' in 'pops' [puppet] - 10https://gerrit.wikimedia.org/r/252180 (owner: 10Filippo Giunchedi) [10:59:51] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1795359 (10mark) If we do have spare disks that are not needed elsewhere then sure, let's send them. But make sure eqiad keeps sufficient spares itself. :) [10:59:59] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795360 (10Lydia_Pintscher) [11:03:39] <_joe_> thanks Lydia_WMDE [11:03:42] <_joe_> :) [11:03:48] np [11:04:08] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795369 (10jcrespo) To update the latest issues identified: * As this creates on... [11:22:54] 6operations, 10Traffic: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#1795404 (10Chmarkine) > We could start with one-off services that are more technical in nature, which normal users would rarely connect to and aren't critical to them, such as icinga.wiki... [11:23:42] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795406 (10daniel) This script show open one connection per db cluster, not one p... [11:28:58] (03PS1) 10Faidon Liambotis: apt: enable backports on Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) [11:29:05] moritzm: ^ [11:29:08] (03PS1) 10Giuseppe Lavagetto: maintenance: reduce the lightprocess count from cli [puppet] - 10https://gerrit.wikimedia.org/r/252203 [11:29:14] and whoever else is interested, obviously [11:32:32] (03PS1) 10Aude: Switch dispatcher subscriptionLookupMode to subscriptions only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252204 (https://phabricator.wikimedia.org/T112245) [11:33:45] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795441 (10jcrespo) > the connections should be pooled We only pool connections... [11:36:28] (03CR) 10Muehlenhoff: [C: 04-1] apt: enable backports on Debian systems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) (owner: 10Faidon Liambotis) [11:36:51] (03PS2) 10Giuseppe Lavagetto: maintenance: reduce the lightprocess count from cli [puppet] - 10https://gerrit.wikimedia.org/r/252203 [11:37:01] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: reduce the lightprocess count from cli [puppet] - 10https://gerrit.wikimedia.org/r/252203 (owner: 10Giuseppe Lavagetto) [11:39:02] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795459 (10aaron) See "+channel:DBPerformance +message:"*connections made*"" at l... [11:39:09] (03PS2) 10Faidon Liambotis: apt: enable backports on Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) [11:39:10] moritzm: duh [11:39:16] sorry, mixing my hats [11:39:23] mirrors.wikimedia.org, I meant [11:39:28] (see PS2) [11:40:03] heh :-) [11:40:52] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795463 (10aaron) I'll try to look into whether the problem is in core or not, th... [11:42:09] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) (owner: 10Faidon Liambotis) [11:42:30] anyone else want to comment on backports before I merge? [11:43:55] <_joe_> paravoid: just let me take a look :) [11:44:20] alright :) [11:45:46] (03CR) 10Giuseppe Lavagetto: [C: 031] apt: enable backports on Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) (owner: 10Faidon Liambotis) [11:46:02] (03CR) 10Filippo Giunchedi: [C: 031] apt: enable backports on Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) (owner: 10Faidon Liambotis) [11:46:34] \o/ [11:46:39] (03CR) 10Faidon Liambotis: [C: 032] apt: enable backports on Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/252202 (https://phabricator.wikimedia.org/T107507) (owner: 10Faidon Liambotis) [11:47:01] (03PS1) 10Muehlenhoff: Assign grain for impala master [puppet] - 10https://gerrit.wikimedia.org/r/252205 [11:47:06] <_joe_> paravoid: what comment_old does is kinda scary [11:47:20] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795473 (10daniel) >>! In T118162#1795441, @jcrespo wrote: > We only pool connect... [11:47:33] _joe_: and also not terribly useful :-) [11:48:50] it's useful to comment what d-i sets up [11:49:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [11:49:15] indeed, didn't think of that [11:50:40] (03Abandoned) 10Muehlenhoff: impala: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [11:55:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [11:56:52] <_joe_> dcausse: ping [12:05:00] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:05:50] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:06:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [12:15:06] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 2 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1795504 (10jcrespo) We can go into details of terminology, but reusing connection... [12:17:58] <_joe_> paravoid: ^^ your change needs merging :) [12:18:06] gah, sorry [12:18:14] done [12:18:39] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:19:29] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:20:36] updated ubuntu yesterday [12:20:54] on 15.10 now, with openssh 6.9p1 [12:21:38] I have my ssh config set up so I can just type certain server names, without needing the full .eqiad.wmnet/.codfw.wmnet etc. [12:21:55] or did... because "ssh tin" gives "ssh: Could not resolve hostname tin.eqiad.wmnet: Name or service not known" [12:22:00] while "ssh tin.eqiad.wmnet" works... [12:24:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [12:25:00] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 118, down: 0, dormant: 0, excluded: 1, unused: 0 [12:25:09] (03PS1) 10Filippo Giunchedi: restbase: raise cassandra pending compaction alerts [puppet] - 10https://gerrit.wikimedia.org/r/252207 [12:25:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 1, unused: 0 [12:26:38] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 3 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1795538 (10faidon) @chasemp, is this done? [12:26:39] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:29:05] hmm, this looks related: https://bugzilla.mindrot.org/show_bug.cgi?id=2267 [12:29:31] (03CR) 10Phuedx: Use CirrusSearch API in RelatedArticles on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) (owner: 10Bmansurov) [12:30:38] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1795543 (10faidon) [12:30:40] 6operations, 6Labs, 5Patch-For-Review: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1795540 (10faidon) 5Open>3Resolved a:3faidon The patch above got three +1s and was therefore merged. [12:32:54] (03PS1) 10Jcrespo: [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [12:33:49] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 (owner: 10Jcrespo) [12:33:59] (03PS2) 10Muehlenhoff: Assign grain for impala master [puppet] - 10https://gerrit.wikimedia.org/r/252205 [12:35:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign grain for impala master [puppet] - 10https://gerrit.wikimedia.org/r/252205 (owner: 10Muehlenhoff) [12:36:44] (03PS2) 10Jcrespo: [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [12:37:32] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 (owner: 10Jcrespo) [12:37:48] So adding "CanonicalizeHostname yes" to the top of my config fixes it [12:40:30] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Puppet has 2 failures [12:47:42] (03PS3) 10Jcrespo: [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [12:48:54] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 (owner: 10Jcrespo) [12:50:45] vim is trolling me and converting my spaces into tabs [12:50:54] will fix later [12:52:34] (03PS1) 10Muehlenhoff: Assign salt grain for heze [puppet] - 10https://gerrit.wikimedia.org/r/252211 [12:55:30] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1795557 (10fgiunchedi) I've started `nodetool decommission` yesterday morning on restbase2001, still going ``` restbase2001:/var/log/cassandra$ n... [12:58:56] (03PS1) 10Muehlenhoff: Move holmium and labservices1001 to the labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252212 [13:03:03] (03CR) 10Muehlenhoff: "All fine in puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/252212 (owner: 10Muehlenhoff) [13:03:09] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grain for heze [puppet] - 10https://gerrit.wikimedia.org/r/252211 (owner: 10Muehlenhoff) [13:05:59] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:06:00] PROBLEM - puppet last run on mw1036 is CRITICAL: CRITICAL: Puppet has 1 failures [13:08:38] 6operations: diamond doesn't gracefully handled elasticsearch failure - https://phabricator.wikimedia.org/T117461#1795590 (10RobH) a:3chasemp I've assigned this to Chase since he was going to take a look at it. If this ends up being incorrect, please place this back up for grabs (just triaging unassigned requ... [13:15:30] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:16:51] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1795605 (10BBlack) [13:17:07] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1227458 (10BBlack) oops, my nginx version numbers were all off-by-one in the task description. fixed! [13:20:41] (03PS1) 10Giuseppe Lavagetto: instrumentation: add alerting endpoint [WiP] [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 [13:21:09] <_joe_> bblack: ^^ this is a simple implementation of an alerting endpoint [13:21:53] 6operations, 10hardware-requests, 5Patch-For-Review: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1795608 (10RobH) Please note that the two sub-tasks have now gotten quotes in, and they are in both the sub-task task descriptions, as well as a comparison on the... [13:24:18] (03CR) 10BBlack: [C: 031] "+1 for looks conceptually correct, didn't look deeply at the code itself really." [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 (owner: 10Giuseppe Lavagetto) [13:27:14] 6operations, 10hardware-requests, 5Patch-For-Review: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1795615 (10RobH) This was requested by a few folks, so ideally I'd like some folks to review these quotes on the sub-tasks and on the gsheet. I've reviewed them,... [13:30:20] RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:20] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:10] (03PS1) 10BBlack: Turn on instrumentation for pybal 1.12+ [puppet] - 10https://gerrit.wikimedia.org/r/252216 [13:40:41] (03PS12) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [13:42:18] (03CR) 10BBlack: [C: 032 V: 032] varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [13:44:28] (03CR) 10RobH: [C: 031] "Looks good, this is slated for puppetswat today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [13:45:40] kart_: your proposed puppet swat patch (for today) failed code review. https://gerrit.wikimedia.org/r/#/c/252172/ Just FYI if it isn't fixed in time for puppetswat (2+ hours from now) it'll need to get shoved to Thursdays swat. [13:46:39] robh: which patch? [13:46:57] https://gerrit.wikimedia.org/r/#/c/252172/ [13:47:09] I have not added any patch in puppet swat. [13:47:13] only: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151110T1600 [13:47:17] oh wait [13:47:21] sorry, i read the wrong damned thing [13:47:25] thats normal swat. [13:47:28] that's for morning swart [13:47:28] :) [13:47:52] (03CR) 10RobH: "Mistaken, this wasnt puppet swat, but normal swat. Disregard my review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [13:47:58] it has some magic jenkins failure. [13:48:01] yea my bad [13:48:09] i have no puppet swat patches [13:48:17] I must really want them to start reading normal swat ;] [13:48:44] it probably needs a bit more publicity, last week was also only one patch for the entire week [13:53:01] (03CR) 10Mobrovac: "I think it's ok as a temp patch, but we should find an acceptable normal-day-prod threshold once recompactions induced by the strategy cha" [puppet] - 10https://gerrit.wikimedia.org/r/252207 (owner: 10Filippo Giunchedi) [14:02:46] <_joe_> bblack: about instrumentation, I already have a patch, just didn't submit it [14:02:48] !log swift codfw-prod: ms-be2017 / ms-be2019 / ms-be2021 weight 3000 [14:02:50] <_joe_> meh [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:52] <_joe_> sorry [14:03:52] moritzm: yea, i think for puppet swat i shall email ops list about how folks can add things to puppet swat [14:03:59] or engineering list... or both. [14:04:53] <_joe_> robh: not like we didn't do it already [14:05:39] <_joe_> robh: it seems that almost no one has patches for us atm [14:05:46] apergos: still waiting for those snapshot reviews [14:05:51] it's getting kinda ridiculous [14:05:54] that just means its successful early on =] [14:06:02] it took me less time to made them than ping you for reviewing them [14:06:12] make* [14:19:40] _joe_: robh I would not mind adding patches to puppet swat but the time slot is inconvenient :-\ [14:19:59] 0 unhandled critical hosts and services, that is good! [14:20:16] and lot of patches are reviewed / landed as part of the ops clinic I believe [14:21:37] <_joe_> hashar: no it's not [14:21:55] <_joe_> if that happens, that's why puppetswat doesn't get populated [14:22:49] 6operations, 10ops-esams, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#1795741 (10faidon) 3NEW a:3faidon [14:22:54] _joe_: feel free to use yours then. I just wanted to go ahead and get it running while I've got 1.12 on all the backup LVS, so we can make sure it doesn't cause issues. [14:23:17] 6operations, 10ops-esams: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#1795750 (10faidon) [14:23:18] 6operations, 10ops-esams, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#1795749 (10faidon) [14:23:22] <_joe_> bblack: yes, I'll merge that soon [14:23:27] <_joe_> err, submit [14:23:34] <_joe_> in ~ 1 hour tops [14:23:42] 6operations, 10ops-esams, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#1795741 (10faidon) [14:27:55] 6operations, 10netops: Figure out the source of QSFP+ errors with DAC + MX480 - https://phabricator.wikimedia.org/T118259#1795771 (10faidon) 3NEW a:3faidon [14:35:02] 6operations, 10netops: Figure out the source of QSFP+ errors with DAC + MX480 - https://phabricator.wikimedia.org/T118259#1795786 (10faidon) This is now Juniper Case ID [[ https://casemanager.juniper.net/casemanager/#/cmdetails/2015-1110-0298 | 2015-1110-0298 ]]. [14:35:38] hashar: if the timeslot is overall bad for folks, we could look at shifting one of them to a different time [14:35:46] ie: tuesday stays same, thursday shifts, or whatever [14:36:11] hashar: too late or too early for ya? [14:36:25] (i dont assume anything based on where folks live, we all have odd sleep schedules ;) [14:36:35] iirc that is 6pm - 7pm for me [14:36:47] so something a bit early in the EU hours would be preferred? [14:36:51] which is also the family busy hours (preparing dinners, taking care of kids etc) [14:37:01] maybe we can get one during east coast morning [14:37:11] which would be middle afternoon in europe. [14:37:18] I think that would be possible overall, we tend to have a lot of coverage then [14:37:18] might be useful for the east coast / europe cabal [14:37:43] and we tend to pair opsen on puppetswat so we would likely be able to cover that, i think, all just my viewpoint mind you [14:37:48] but that also mean it is in the middle of the afternoon work activity for european ops [14:38:08] (03PS1) 10Filippo Giunchedi: fix stdout/stderr shell redirection syntax [puppet] - 10https://gerrit.wikimedia.org/r/252222 [14:38:12] true, but when you sign up for puppetswat you kind of assume you'll lose personal productivity during that time [14:38:14] imo [14:38:22] in the end, the patch that are rather urgent tends to be reviewed / merged before I even ask about it [14:38:31] grrrit-wm: you ded? [14:38:38] nevermind [14:38:48] rest of my open patches are merely backlog items that are already cherry picked on the labs puppetmaster. So it is not much of a concern [14:47:21] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1795810 (10Qgil) Great news! I'm very happy to see that our investment in Spaces keeps paying off. So... does this mean that RT is now completely migrated, and not in active use anymore? [14:49:08] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1795812 (10Aklapper) >>! In T93760#1795810, @Qgil wrote: > So... does this mean that RT is now completely migrated, and not in active use anymore? Nope, see {T118176} [14:52:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add patch for CVE-2015-5307 (not merged upstream yet) [debs/linux] - 10https://gerrit.wikimedia.org/r/252174 (owner: 10Muehlenhoff) [14:56:13] (03Abandoned) 10Muehlenhoff: labservices1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247209 (owner: 10Muehlenhoff) [14:56:27] (03Abandoned) 10Muehlenhoff: holmium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [15:00:31] (03CR) 10Matthias Mullie: "Withdrawing -2. This should be ok to merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie) [15:05:07] (03PS1) 10Muehlenhoff: Enable ferm on two additional kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252224 [15:12:16] (03Abandoned) 10Filippo Giunchedi: restbase: raise cassandra pending compaction alerts [puppet] - 10https://gerrit.wikimedia.org/r/252207 (owner: 10Filippo Giunchedi) [15:12:33] (03PS1) 10Rush: dhcp: add labs-hosts1-b-codfw subnet definition [puppet] - 10https://gerrit.wikimedia.org/r/252226 (https://phabricator.wikimedia.org/T117097) [15:13:28] 6operations, 10ops-eqiad: reclaim lawrencium to spares - https://phabricator.wikimedia.org/T117477#1795863 (10Cmjohnson) Swap raid controllers from h710 to original h310 [15:13:49] PROBLEM - Host labcontrol1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:59] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:14:25] andrewbogott, chasemp ^ ? [15:14:39] yikes! [15:14:42] I'm not doing anything there so that's unexpected afaik [15:14:44] * andrewbogott looks [15:15:01] (03CR) 10RobH: [C: 031] dhcp: add labs-hosts1-b-codfw subnet definition [puppet] - 10https://gerrit.wikimedia.org/r/252226 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush) [15:15:40] RECOVERY - Host labcontrol1001 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [15:16:08] paravoid: and now it’s back [15:16:14] wtf [15:16:19] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [15:17:13] andrewbogott: I think that may have been me [15:17:54] cmjohnson1: ah, ok. No harm done, in that case [15:17:56] (03CR) 10Rush: [C: 032] dhcp: add labs-hosts1-b-codfw subnet definition [puppet] - 10https://gerrit.wikimedia.org/r/252226 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush) [15:26:13] ugh [15:26:17] * YuviPanda got paged [15:26:25] kinof [15:26:28] andrewbogott: chasemp all good? [15:26:31] PROBLEM - grafana-admin.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:39] * YuviPanda should go give his talk soon [15:27:00] (03PS4) 10Jcrespo: [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [15:27:00] PROBLEM - salt-minion processes on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:00] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:06] YuviPanda: caused a lot of failure from dns blip but hoepfully that is all shaking out here, we can ping you if issues persist :) [15:27:09] PROBLEM - puppet last run on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:19] PROBLEM - DPKG on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:20] PROBLEM - Check size of conntrack table on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:29] PROBLEM - Disk space on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:40] PROBLEM - configured eth on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:40] PROBLEM - HTTP on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:00] PROBLEM - RAID on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:11] PROBLEM - dhclient process on krypton is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:35] (03PS5) 10Jcrespo: [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [15:32:54] !log upgrade rsync on ms-be1* per T93587 [15:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:13] (03PS6) 10Jcrespo: [WIP] Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [15:39:29] 6operations, 5Patch-For-Review, 7Swift: incompatible rsync transfers between rsync 3.0.9 and 3.1 (precise vs trusty) - https://phabricator.wikimedia.org/T93587#1795950 (10fgiunchedi) 5Open>3Resolved rsync has been upgraded in eqiad to 3.1.0, so far so good [15:45:15] (03PS6) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) [15:45:18] (03PS2) 10Aude: Switch dispatcher subscriptionLookupMode to subscriptions only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252204 (https://phabricator.wikimedia.org/T112245) [15:49:52] (03PS7) 10Jcrespo: Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [15:51:29] RECOVERY - RAID on krypton is OK: OK: no RAID installed [15:51:39] RECOVERY - dhclient process on krypton is OK: PROCS OK: 0 processes with command name dhclient [15:51:41] !log restarted krypton [15:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:51] ^ it went completely down with all services.. [15:51:51] RECOVERY - grafana-admin.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 534 bytes in 0.005 second response time [15:52:09] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1796007 (10chasemp) [15:52:29] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.009 second response time [15:52:29] RECOVERY - salt-minion processes on krypton is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:52:29] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [15:52:40] RECOVERY - DPKG on krypton is OK: All packages OK [15:52:42] RECOVERY - Check size of conntrack table on krypton is OK: OK: nf_conntrack is 0 % full [15:52:44] let's not get used to ingoring these [15:52:50] RECOVERY - Disk space on krypton is OK: DISK OK [15:52:59] RECOVERY - HTTP on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 575 bytes in 0.007 second response time [15:52:59] RECOVERY - configured eth on krypton is OK: OK - interfaces up [15:56:20] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1796023 (10Ottomata) Sounds good. Shall I just find a time and set one up? [15:57:04] (03PS8) 10Jcrespo: Adding tendril::maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/252208 [15:58:08] and this is why scheduling downtimes for maintenance work is important - if we usually don't then everybody assumes a bunch of icinga-wm stuff just means "somebody is working on it" [15:58:29] (03PS1) 10Cmjohnson: Removing all polonium entries: site.pp entry role is spare but removing all together, updated smtp hosts to mx1001.wikimedia.org for hierdata and analytics. [puppet] - 10https://gerrit.wikimedia.org/r/252229 [15:59:18] so [15:59:36] jzerebecki: kart_ I have made the ContentTranslation patch at https://gerrit.wikimedia.org/r/#/c/252172/ to depends on the Wikidata patch at https://gerrit.wikimedia.org/r/#/c/252227/ [15:59:48] I believe we can +2 the content translation one right now [15:59:53] Zuul is going to wait for the wikidata one to merge in [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151110T1600). Please do the needful. [16:00:05] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:10] and most probably magically trigger the gate-and-submit for the cx change whenever the wikidata one lands [16:00:36] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/1225/" [puppet] - 10https://gerrit.wikimedia.org/r/252208 (owner: 10Jcrespo) [16:01:48] James_F, hey [16:01:55] 6operations, 10ops-eqiad: neodymium stalling on checking hardware / after reinstall still pxe boots - https://phabricator.wikimedia.org/T118272#1796037 (10RobH) 3NEW a:3Cmjohnson [16:02:06] Krenair: SWAT'ng? [16:02:12] guess so [16:02:22] * aude waves [16:02:25] see hashar's comments above for CX change. [16:02:40] can we do wikidata - cx patch firsts ? [16:03:19] I would like to try to get the ContentTranslation patch https://gerrit.wikimedia.org/r/#/c/252172/ +2ed first [16:03:22] and see what happens ci wise [16:03:31] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1796044 (10chasemp) 5Resolved>3Open reopening for the remaining subtask :) [16:03:48] then +2 the Wikidata one https://gerrit.wikimedia.org/r/#/c/252227/ [16:03:48] and in theory they should both land [16:04:30] https://gerrit.wikimedia.org/r/#/c/252172/ is +2'd [16:04:57] (03PS2) 10Cmjohnson: Removing all lead polonium entries: site.pp entry role is spare but removing all together, updated smtp hosts to mx1001.wikimedia.org for hierdata and analytics. [puppet] - 10https://gerrit.wikimedia.org/r/252229 [16:05:32] https://gerrit.wikimedia.org/r/#/c/252227/ is +2'd [16:06:10] hashar, ^ [16:06:18] too fast :-} but heck [16:06:37] from the test pipeline we know they are going to work just fine [16:06:42] ok [16:06:48] (at least the tests) [16:07:05] Zuul learned about cross repositories dependencies, I got it upgraded last week but have yet to announce / document it [16:07:56] how do you make a cross-repo dependency? [16:08:13] the depends-on line? [16:08:56] kart_: qunit passed :-) [16:09:00] Krenair: yeah [16:09:09] hashar: \o/ [16:09:26] Krenair: causes Zuul to search in Gerrit for changes matching the change-id. If a change is still open it will refuses to merge [16:09:45] great [16:09:48] so if you have A Depends-On: B, Zuul would not submit A until B got merged [16:10:09] Wikidata merged! [16:11:21] (03CR) 10Bmansurov: Use CirrusSearch API in RelatedArticles on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) (owner: 10Bmansurov) [16:11:35] so at least CI is happy [16:11:48] kart_: aude: Iguess you can do the cherry pick dance now :-/ [16:13:14] (03PS4) 10Bmansurov: Use CirrusSearch API in RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) [16:13:46] hashar: :) [16:13:52] CX patch is merged too [16:15:02] (03CR) 10RobH: [C: 04-1] "It looks good, except when returning to spares, completely remove the entry from dhcp. Since when it is put back in use, it may not use th" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/252229 (owner: 10Cmjohnson) [16:15:05] and for what it is worth, releng is discussing about getting rid of those cherry picking steps [16:16:52] Krenair: Hey. [16:17:01] Oh, damn, didn't see the ping. [16:17:13] (03CR) 10RobH: "also neither lead nor polonium are listed on spares page, is there a task to reclaim them? (if so, please list in commit message.)" [puppet] - 10https://gerrit.wikimedia.org/r/252229 (owner: 10Cmjohnson) [16:17:52] (Yes, good to go.) [16:19:44] (03Abandoned) 10Faidon Liambotis: Add backports repository, but with a low priority [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh) [16:19:44] Krenair: ping me for test :) [16:21:31] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1796120 (10Nuria) I think I am a member of all the analytics groups that I need to be. Given that it seems that cassandra doesn't have a read only group I guess i need to be added... [16:21:44] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1796121 (10Nuria) a:5Nuria>3kevinator [16:22:50] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/Wikidata/Wikidata.php: https://gerrit.wikimedia.org/r/#/c/252227/ (duration: 00m 45s) [16:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:55] jzerebecki, kart_: ^ [16:24:03] \O/ [16:24:05] thx [16:24:10] kart_: thank you for your patience [16:24:18] jzerebecki: thanks a ton for the quick patch [16:24:19] cool. Testing now. [16:25:03] going out for some more meeting then dinner [16:25:14] Shall I sync the CX one now kart_? [16:26:28] Krenair: yes, please. [16:26:54] eh. I thought you sync CX :D [16:27:46] no... as the log shows, I synchronized the wikidata change... [16:27:54] yes :) [16:28:04] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/ContentTranslation/api/ApiQueryContentTranslationSuggestions.php: https://gerrit.wikimedia.org/r/#/c/252172/ (duration: 00m 35s) [16:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:30] Krenair: tested. Working :) [16:28:49] Thanks jzerebecki, hashar and Krenair. [16:28:51] np [16:29:00] yw [16:29:27] (03PS2) 10Alex Monk: Enable VisualEditor for 25% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:29:31] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for 25% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:30:06] Thanks Krenair. [16:30:53] Krenair: I added https://gerrit.wikimedia.org/r/252231 as well if that's OK. [16:31:07] yep [16:31:12] Thank you. [16:31:49] * Krenair kicks jenkins [16:31:56] why is it not merging? [16:33:04] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for 25% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:33:24] Krenair: Stuck in the queue? [16:33:35] Running now. [16:33:50] (03Merged) 10jenkins-bot: Enable VisualEditor for 25% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:34:24] (03PS3) 10Cmjohnson: Removing all lead polonium entries: site.pp entry role is spare but removing all together, updated smtp hosts to mx1001.wikimedia.org for hierdata and analytics. bug: task T113962 Change-Id: I5f75d11723ecdf00734a1f876f82eb12ae7d8b34 [puppet] - 10https://gerrit.wikimedia.org/r/252229 [16:34:55] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/250471/ (duration: 00m 34s) [16:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:32] James_F, ^ [16:37:41] Krenair: Checked, seems OK. [16:39:55] aude, shall we do your config patches while waiting for jenkins? [16:40:04] sure [16:40:29] (03PS2) 10Alex Monk: Exclude LiquidThread namespaces from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250000 (https://phabricator.wikimedia.org/T117174) (owner: 10Aude) [16:40:32] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [16:40:34] (03CR) 10Alex Monk: [C: 032] Exclude LiquidThread namespaces from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250000 (https://phabricator.wikimedia.org/T117174) (owner: 10Aude) [16:41:00] (03Merged) 10jenkins-bot: Exclude LiquidThread namespaces from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250000 (https://phabricator.wikimedia.org/T117174) (owner: 10Aude) [16:41:26] (03PS1) 10Cmjohnson: Removing production dns entries for lead and polonium bug: task T113962 [dns] - 10https://gerrit.wikimedia.org/r/252232 [16:41:29] 250k changes :) [16:41:37] :D [16:41:57] !log krenair@tin Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/250000/ (duration: 00m 34s) [16:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:02] aude, ^ please test [16:42:02] Krenair: Yup. And only half of them are i18n-bot. ;-) [16:42:09] looks good [16:42:09] haha, true [16:42:26] (03CR) 10Cmjohnson: [C: 032] Removing all lead polonium entries: site.pp entry role is spare but removing all together, updated smtp hosts to mx1001.wikimedia.org for hi [puppet] - 10https://gerrit.wikimedia.org/r/252229 (owner: 10Cmjohnson) [16:42:39] (03PS3) 10Alex Monk: Switch dispatcher subscriptionLookupMode to subscriptions only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252204 (https://phabricator.wikimedia.org/T112245) (owner: 10Aude) [16:43:00] (03CR) 10Alex Monk: [C: 032] Switch dispatcher subscriptionLookupMode to subscriptions only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252204 (https://phabricator.wikimedia.org/T112245) (owner: 10Aude) [16:43:10] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1796190 (10fgiunchedi) @andrew thoughts on what component/issue it might be causing this? [16:43:27] (03Merged) 10jenkins-bot: Switch dispatcher subscriptionLookupMode to subscriptions only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252204 (https://phabricator.wikimedia.org/T112245) (owner: 10Aude) [16:44:21] !log krenair@tin Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/252204/ (duration: 00m 34s) [16:44:22] aude, please check ^ [16:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:38] looks ok [16:45:18] thanks :) [16:45:18] !log krenair@tin Synchronized wmf-config/Wikibase-labs.php: (no message) (duration: 00m 34s) [16:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:13] Supposedly jenkins is still going with that core commit [16:47:17] (03PS4) 10Dereckson: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:50:48] (03CR) 10Cmjohnson: [C: 032] "Removed all but mgmt entries for polonium and lead." [dns] - 10https://gerrit.wikimedia.org/r/252232 (owner: 10Cmjohnson) [16:51:19] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: https://gerrit.wikimedia.org/r/#/c/252231/ (duration: 00m 35s) [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:23] James_F, ^ please test [16:52:50] Krenair: Yup, works well. [16:52:58] ok, done [16:54:13] 6operations, 10ops-eqiad: Wipe Polonium and lead - https://phabricator.wikimedia.org/T118279#1796247 (10Cmjohnson) 3NEW a:3Cmjohnson [16:58:27] (03CR) 10Krinkle: "Fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [16:58:58] (03PS3) 10Krinkle: Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 (owner: 10Reedy) [16:59:07] (03CR) 10Krinkle: [C: 031] Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 (owner: 10Reedy) [17:00:04] Deploy window Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151110T1700) [17:06:52] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:13:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [17:24:21] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796345 (10Cmjohnson) [17:24:22] 6operations, 10ops-eqiad: neodymium stalling on checking hardware / after reinstall still pxe boots - https://phabricator.wikimedia.org/T118272#1796343 (10Cmjohnson) 5Open>3Resolved Pulled the 2 ssds and did a hard reset. [17:25:52] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [17:27:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [17:32:18] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:33:40] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 17 data above and 9 below the confidence bounds [17:38:22] (03PS2) 10Dzahn: [Planet Wikimedia] Add Kunal Meta/legoktm to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/252170 (owner: 10Nemo bis) [17:38:25] 6operations, 7Database: dbtree fails to render correctly on a new server (mw1152) both with zend php and hhvm - https://phabricator.wikimedia.org/T118159#1796366 (10Joe) 5Open>3Resolved [17:40:05] 6operations, 10ops-eqiad, 5Patch-For-Review: Return polonium/lead to spares - https://phabricator.wikimedia.org/T113962#1796371 (10Cmjohnson) Switch cfg changes cmjohnson@asw-c-eqiad# show |compare [edit interfaces interface-range vlan-public1-c-eqiad] - member ge-4/0/24; - member ge-7/0/32; [edit inte... [17:42:36] (03PS1) 10Giuseppe Lavagetto: pybal: don't write pool files using confd [puppet] - 10https://gerrit.wikimedia.org/r/252242 [17:42:38] (03PS1) 10Giuseppe Lavagetto: pybal: allow turning on using etcd for configuration [puppet] - 10https://gerrit.wikimedia.org/r/252243 [17:42:40] (03PS1) 10Giuseppe Lavagetto: pybal: add support for instrumentation [puppet] - 10https://gerrit.wikimedia.org/r/252244 [17:42:42] (03PS1) 10Giuseppe Lavagetto: pybal: install monitoring [puppet] - 10https://gerrit.wikimedia.org/r/252245 (https://phabricator.wikimedia.org/T102394) [17:42:53] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Add Kunal Meta/legoktm to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/252170 (owner: 10Nemo bis) [17:44:29] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [17:45:33] 6operations, 10ops-eqiad: neodymium stalling on checking hardware / after reinstall still pxe boots - https://phabricator.wikimedia.org/T118272#1796401 (10RobH) Chris is working on this again, as it seems stuck on "Scanning for devices. Please wait, this may take several minutes..." and its taken over 10. It... [17:46:29] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:48:54] robh [17:49:11] reseated dimmm working now [17:49:13] try agai [17:49:51] will do [17:49:56] trying now [17:58:08] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [17:58:29] (03CR) 10Nemo bis: [Planet Wikimedia] Add Kunal Meta/legoktm to English Planet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252170 (owner: 10Nemo bis) [17:58:47] robh: success? [17:58:53] appears that way on my end [17:59:10] yep [17:59:12] \o/ [17:59:19] sorry, i went afk to make a pb&j [17:59:23] apergos: ^ so yay [17:59:31] i'm about to dispatch to you, it has not had keys signed. [17:59:48] 6operations, 10ops-eqiad: neodymium stalling on checking hardware / after reinstall still pxe boots - https://phabricator.wikimedia.org/T118272#1796540 (10RobH) chris reseated a dimm, fixed it. [18:00:00] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796543 (10RobH) [18:00:09] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:00:19] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796554 (10RobH) a:5RobH>3ArielGlenn Jessie installed, returfing to @ArielGlenn. [18:00:28] cmjohnson1: thx dude [18:00:58] haha, wait wrong os [18:01:02] its the old ubuntu.. have to reinstall [18:01:08] seems the os was having the issues i thought [18:01:13] jessie detected and installed on ssds [18:01:22] cmjohnson1: You may want to wipe those ssds in a spare system, sorry dude [18:01:26] (03PS1) 10Dzahn: peopleweb: add link to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/252248 [18:01:34] yep, already on that [18:01:37] so i set SAS as bootable, but jessie saw SSDs as sda and sdb and installed on them [18:01:52] so, rebooting into pxe again! [18:01:55] oncemore with feeling. [18:03:47] and since the ssds werent set to bootable in the h310, they wouldnt boot the os [18:03:53] and it failed back into the endless loop of installing [18:03:56] (this has been seen before) [18:06:01] (03PS2) 10Dzahn: peopleweb: add link to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/252248 [18:06:39] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796595 (10RobH) a:5ArielGlenn>3RobH I stand corrected, it booted into ubuntu (old os on disks). installing now. [18:07:01] meh, no root filesystem defined... wtf system [18:11:13] (03PS2) 10Ori.livneh: fix stdout/stderr shell redirection syntax [puppet] - 10https://gerrit.wikimedia.org/r/252222 (owner: 10Filippo Giunchedi) [18:11:35] (03CR) 10Ori.livneh: [C: 032 V: 032] "Wow, amazing. Three of those are mine. Don't tell anyone or I'll have my POSIX license revoked." [puppet] - 10https://gerrit.wikimedia.org/r/252222 (owner: 10Filippo Giunchedi) [18:12:59] (03PS3) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [18:19:38] 6operations, 10ops-eqiad, 5Patch-For-Review: Return polonium/lead to spares - https://phabricator.wikimedia.org/T113962#1796713 (10Cmjohnson) 5Open>3Resolved @robh added these to server spares page. Resolving the task [18:20:10] 6operations, 10ops-eqiad: Wipe Polonium and lead - https://phabricator.wikimedia.org/T118279#1796719 (10Cmjohnson) 5Open>3Resolved Disks are wiped. [18:21:38] 6operations, 10ops-eqiad: reclaim lawrencium to spares - https://phabricator.wikimedia.org/T117477#1796722 (10Cmjohnson) 5Open>3Resolved Disks are wiped, already listed on server spares page. [18:22:47] (03PS1) 10Southparkfan: Major overhaul of Main Page [debs/wikistats] - 10https://gerrit.wikimedia.org/r/252249 [18:23:22] 6operations, 10ops-eqiad, 10netops: test new sfp-t - https://phabricator.wikimedia.org/T118178#1796728 (10Cmjohnson) Inserted in asw-d7-eqiad:24 [18:24:02] ebernhardson: I think search updates lag for wikitechwiki. Consequence of job runner situation? [18:24:19] 6operations, 10ops-eqiad, 10netops: test new sfp-t - https://phabricator.wikimedia.org/T118178#1796738 (10Cmjohnson) 5Open>3Resolved the sfp does show up Xcvr 24 REV 01 740-013111 ADD18DF81958 SFP-T [18:24:55] bblack: you around? wanted to check with you about aiding an a/b test for reading's strategic tests stuff. basically, the thought is to *carefully* construct a redirect w/ cookie logic to send 1/1000 unauthenticated desktop users (and subsequently cookie'd users who don't then later intentionally click Desktop) to mdot in order to measure differences in [18:24:55] engagement. bd808 and i were doing the thought experiment on this last week [18:26:27] ostriches: perhaps. I've been seeing some strange things in the job queue that i can't explain [18:26:48] Hmm [18:27:01] ostriches: for example, yesterday enwiki had 177090 cirrusSearchElasticaWrite jobs. Today is has 177091 [18:27:11] this morning it still had 177090. I have no clue why they arn't being run [18:28:05] ostriches: how long are they taking to do the update? [18:28:25] I updated some pages over the last 10 mins, still no update in search results. [18:28:56] ebernhardson: Also, enwiki: [18:28:58] cirrusSearchElasticaWrite: 0 queued; 0 claimed (0 active, 0 abandoned); 177091 delayed [18:29:09] ostriches: thats what i just said 4 lines above ;) [18:29:30] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:29:30] Delayed. [18:29:35] I'm not familiar with that status. [18:29:39] * ostriches is behind on jq. [18:29:47] ostriches: i have to do a standup, back in 10 minutes [18:29:53] k [18:29:55] ostriches: delayed means they have a 'release after' timestamp [18:30:05] but i'm not sure we are properly releasing them... [18:30:59] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:32:59] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:33:39] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:34:08] PROBLEM - YARN NodeManager Node-State on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:34:09] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:35:58] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [18:36:29] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 1 failures [18:36:40] PROBLEM - SSH on analytics1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:48] PROBLEM - RAID on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:36:48] PROBLEM - Disk space on Hadoop worker on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:36:59] PROBLEM - Check size of conntrack table on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:36:59] PROBLEM - salt-minion processes on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:19] PROBLEM - puppet last run on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:19] PROBLEM - Disk space on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:29] PROBLEM - Hadoop DataNode on analytics1031 is CRITICAL: Timeout while attempting connection [18:37:29] PROBLEM - configured eth on analytics1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:48] PROBLEM - DPKG on analytics1031 is CRITICAL: Timeout while attempting connection [18:37:48] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: Timeout while attempting connection [18:38:08] PROBLEM - dhclient process on analytics1031 is CRITICAL: Timeout while attempting connection [18:38:17] hnmm k [18:39:04] hmm, this is ust because the node is busy and icinga has trouble reaching it [18:39:46] i think... [18:39:55] the 1030 one for sure is [18:41:29] hm, 1031 different [18:41:30] [15042236.643855] BUG: soft lockup - CPU#23 stuck for 22s! [du:19704] [18:41:30] hm [18:41:45] !log powercycling analytics1031 [18:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:08] PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:59] RECOVERY - Disk space on analytics1031 is OK: DISK OK [18:44:59] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [18:45:08] RECOVERY - Host analytics1031 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [18:45:10] RECOVERY - Hadoop DataNode on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [18:45:10] RECOVERY - configured eth on analytics1031 is OK: OK - interfaces up [18:45:19] RECOVERY - DPKG on analytics1031 is OK: All packages OK [18:45:28] RECOVERY - Hadoop NodeManager on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:45:38] RECOVERY - YARN NodeManager Node-State on analytics1031 is OK: OK: YARN NodeManager analytics1031.eqiad.wmnet:8041 Node-State: RUNNING [18:45:49] RECOVERY - dhclient process on analytics1031 is OK: PROCS OK: 0 processes with command name dhclient [18:46:09] RECOVERY - SSH on analytics1031 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:46:19] RECOVERY - RAID on analytics1031 is OK: OK: optimal, 13 logical, 14 physical [18:46:19] RECOVERY - Disk space on Hadoop worker on analytics1031 is OK: DISK OK [18:46:29] RECOVERY - salt-minion processes on analytics1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:46:30] RECOVERY - Check size of conntrack table on analytics1031 is OK: OK: nf_conntrack is 0 % full [18:48:30] (03PS3) 10Dzahn: peopleweb: add link to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/252248 [18:50:46] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796844 (10RobH) a:5RobH>3ArielGlenn Ok, confirmed now it has jessie, and its rebooted into it properly multiple times. [18:53:19] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:58] PROBLEM - cassandra CQL 10.192.16.152:9042 on restbase2001 is CRITICAL: Connection refused [18:55:09] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [18:58:07] (03CR) 10Dzahn: [C: 032] peopleweb: add link to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/252248 (owner: 10Dzahn) [18:59:49] (03CR) 10Dzahn: [C: 031] Move holmium and labservices1001 to the labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252212 (owner: 10Muehlenhoff) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151110T1900). [19:00:24] (03CR) 1020after4: [C: 032] Prepare to enable QuickSurveys in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [19:01:07] (03Merged) 10jenkins-bot: Prepare to enable QuickSurveys in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [19:02:54] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796902 (10ArielGlenn) [19:03:16] fyi https://www.mediawiki.org/wiki/Developers/Maintainers#Operations.2Fsystems_administration has quite a few interesting links for people who are responsible for things -- I'm updating it slightly but still, it's fun to read :) [19:04:11] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:09:20] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [19:12:12] (03PS1) 10ArielGlenn: set up neodymium to be secondary salt master for testing [puppet] - 10https://gerrit.wikimedia.org/r/252256 [19:14:58] (03PS4) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [19:18:44] (03CR) 10ArielGlenn: [C: 032] set up neodymium to be secondary salt master for testing [puppet] - 10https://gerrit.wikimedia.org/r/252256 (owner: 10ArielGlenn) [19:20:11] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:20:51] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:21:08] (03CR) 10Gilles: [C: 031] swift: monitor mediawiki originals upload rate [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi) [19:22:00] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:18] (03PS1) 10Dzahn: tungsten: re-add to puppet,install as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/252257 (https://phabricator.wikimedia.org/T117888) [19:23:41] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 1.168 second response time [19:25:16] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796962 (10ArielGlenn) [19:26:33] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1796965 (10ArielGlenn) 5Open>3Resolved can close this now and continue the rest on T115287 [19:28:46] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1796971 (10ArielGlenn) Aalt master role plus deployment role plus debdeploy role added. master key is different than old master so I need to look at that. Additionally I wonder about having tw... [19:34:49] (03CR) 10Dzahn: "fundraising.pp seems still alive" [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn) [19:37:32] (03PS2) 10Dzahn: tungsten: re-add to puppet,install as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/252257 (https://phabricator.wikimedia.org/T117888) [19:39:19] mutante: \o/ <3 [19:40:10] (03CR) 10Dzahn: [C: 032] tungsten: re-add to puppet,install as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/252257 (https://phabricator.wikimedia.org/T117888) (owner: 10Dzahn) [19:40:20] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200) [19:40:50] (03CR) 10Dzahn: "if i abandon it, i feel we might just forget about it. where/who should i ping about the refactoring you mentioned" [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn) [19:40:50] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200) [19:41:41] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200) [19:44:11] (03PS1) 10Merlijn van Deen: HBA: make sure UseDNS is set [puppet] - 10https://gerrit.wikimedia.org/r/252258 (https://phabricator.wikimedia.org/T116687) [19:47:09] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:29] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim tungsten as spare - https://phabricator.wikimedia.org/T97274#1797067 (10Dzahn) [19:48:30] 6operations, 6Performance-Team, 5Patch-For-Review: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1797066 (10Dzahn) [19:52:40] PROBLEM - RAID on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:52:49] PROBLEM - puppet last run on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:32] (03PS1) 10Dzahn: tungsten: set installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/252259 (https://phabricator.wikimedia.org/T117888) [19:54:09] (03CR) 10Dzahn: [C: 032] tungsten: set installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/252259 (https://phabricator.wikimedia.org/T117888) (owner: 10Dzahn) [19:54:20] RECOVERY - RAID on analytics1030 is OK: OK: optimal, 13 logical, 14 physical [19:54:30] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:54:31] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [20:02:29] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:05:35] (03PS2) 10Muehlenhoff: Enable ferm on two additional kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252224 [20:06:19] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [20:07:12] (03CR) 10Ottomata: [C: 031] Enable ferm on two additional kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252224 (owner: 10Muehlenhoff) [20:08:52] <_joe_> !log restarting jobchron across the cluster [20:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:49] !log disabled puppet on kafk1012/1013 in preparation of ferm activation [20:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:22] (03PS1) 10Dzahn: tungsten: use db.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/252265 [20:11:00] (03CR) 10Dzahn: [C: 032] tungsten: use db.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/252265 (owner: 10Dzahn) [20:11:12] (03CR) 1020after4: [C: 032] Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 (owner: 10Reedy) [20:11:44] (03Merged) 10jenkins-bot: Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 (owner: 10Reedy) [20:12:00] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:12:53] (03CR) 1020after4: [C: 031] Allow viewing php files in raw format in browsers [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) (owner: 10Paladox) [20:13:49] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [20:14:21] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:14:23] !log stopping kafka broker on kafka1012 to apply base ferm [20:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on two additional kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252224 (owner: 10Muehlenhoff) [20:18:30] (03PS3) 10Muehlenhoff: Enable ferm on two additional kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252224 [20:19:02] (03CR) 10Muehlenhoff: [V: 032] Enable ferm on two additional kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/252224 (owner: 10Muehlenhoff) [20:19:26] mutante: I merged your tungsten patch along [20:20:41] moritzm: thanks, got distracted [20:26:13] !log starting kafka broker on kafka1012 after applying base ferm [20:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:02] (03PS6) 10Jhobs: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [20:28:20] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [24.0] [20:29:14] (03PS2) 10Muehlenhoff: Move holmium and labservices1001 to the labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252212 [20:29:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move holmium and labservices1001 to the labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252212 (owner: 10Muehlenhoff) [20:34:07] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [20:34:22] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1797373 (10Andrew) It is probably the result of a leaked host record and/or a leaked dns entry. And, indeed, I was mucking around with the dns backend a couple of days ago so you might have... [20:34:46] uhm... did the user sticky bit get set recently on /srv/mediawiki-staging? (tin) [20:41:07] (03PS1) 10Halfak: Adds de, he, it, nl and vi *spell packages to ores/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/252277 [20:41:58] !log stopping kafka broker on kafka1013 to activate ferm [20:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:45] (03PS7) 10Jhobs: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [20:47:50] !kafka re-enabled on 1013 [20:48:00] !log kafka re-enabled on kafka1013 [20:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:53] (03PS1) 10Muehlenhoff: Assign salt grains for labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252280 [20:51:09] anybody know? deployment train is blocked because I don't have permissions to create the branch checkout [20:51:43] drwxrwxr-x 16 twentyafterfour wikidev 4096 Oct 29 19:38 php-1.27.0-wmf.4 [20:51:45] drwxrwxr-x 16 twentyafterfour wikidev 4096 Nov 5 22:30 php-1.27.0-wmf.5 [20:51:47] drwxr-sr-x 16 mwdeploy wikidev 4096 Nov 10 20:22 php-1.27.0-wmf.6 [20:51:55] php-1.27.0-wmf.6 is owned by mwdeploy :-/ [20:52:09] twentyafterfour: there was a patch about it I think! /me finds [20:52:16] twentyafterfour: Why not? [20:52:35] can't you just sudo as mwdeploy? [20:52:55] I can just sudo -u mwdeploy -i [20:53:22] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1797412 (10Andrew) [20:53:44] Krenair: maybe? I never had to before [20:54:22] (I know it's a workaround for something you should be able to do as your own user, but does this really block the train?) [20:55:21] halfak: where is the ores module used? is this labs-only? [20:55:33] Krenair: I'm not entirely sure, checkoutMediawiki does a lot of things. I'll try it [20:55:42] mutante, labs currently. We just finished our security review, so prod soonish. [20:55:57] halfak: gotcha, thanks. let me merge that change to add the spell packages [20:56:06] \o/ thank you! [20:56:23] (03PS2) 10Dzahn: Adds de, he, it, nl and vi *spell packages to ores/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/252277 (owner: 10Halfak) [20:56:31] (03CR) 10Dzahn: [C: 032] Adds de, he, it, nl and vi *spell packages to ores/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/252277 (owner: 10Halfak) [20:58:05] halfak: yw. you should see a change now. merged on master [20:58:14] * halfak runs puppet [20:58:15] Thanks [21:00:36] (03PS2) 10Muehlenhoff: Assign salt grains for labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252280 [21:00:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labsdns role [puppet] - 10https://gerrit.wikimedia.org/r/252280 (owner: 10Muehlenhoff) [21:00:53] (03CR) 10Dzahn: [C: 031] Allow viewing php files in raw format in browsers [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) (owner: 10Paladox) [21:01:16] twentyafterfour: ^ fine if we just do it? [21:01:31] mutante: I think so [21:02:01] RECOVERY - configured eth on labtestservices2001 is OK: OK - interfaces up [21:02:23] (03PS3) 10Dzahn: Allow viewing php files in raw format in browsers [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) (owner: 10Paladox) [21:02:49] RECOVERY - DPKG on labtestservices2001 is OK: All packages OK [21:03:10] RECOVERY - Disk space on labtestservices2001 is OK: DISK OK [21:03:29] (03CR) 10Dzahn: [C: 032] "as Paladox says, upstream describes this as a workaround (https://secure.phabricator.com/T8170#143157)" [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) (owner: 10Paladox) [21:03:41] RECOVERY - RAID on labtestservices2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [21:04:00] RECOVERY - dhclient process on labtestservices2001 is OK: PROCS OK: 0 processes with command name dhclient [21:04:11] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:06:29] RECOVERY - salt-minion processes on labtestservices2001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:06:45] ostriches: hey, i'm kinda new to SWAT deployments, just wanna make sure I did everything right for adding https://gerrit.wikimedia.org/r/#/c/251133 to the Evening SWAT. (I updated the Deployments calendar.) [21:06:55] (03PS1) 10Halfak: Replaces duplicate myspell-de-ch with myspell-de-de in ores/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/252282 [21:06:56] twentyafterfour: actually.. it does not seem to be either "actual result" nor "expected result" as in the bug description now :p [21:08:27] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1797463 (10Andrew) [21:09:01] mutante: oh? [21:09:20] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [24.0] [21:10:11] twentyafterfour: when i click the "raw" link i just get to another page and from there i get a download link [21:10:22] if i follow his bug description it would have to show in browser now [21:10:28] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10Andrew) [21:10:33] and yea, what he says is what upstream said to do [21:10:53] i'll let Paladox confirm [21:12:38] !log tungsten - signed new puppet certs, re-installed [21:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:10] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [21:14:28] !log aqs1001-1003: CRITICAL: Test Get aggregate page views [21:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:43] !log restbase2001: cassandra CQl refused [21:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:49] mutante: I think this is probably the best we're gonna get [21:15:00] (03CR) 10Muehlenhoff: "Let's treat removing the exception independant of that change (which only fixes the role use). I don't believe paramiko has caught up with" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [21:15:05] twentyafterfour: isn't it like before though? [21:15:35] mutante: maybe I misunderstood, but I thought it didn't actually let you see the file (or even download it?) before ... [21:15:37] !log mira unmerged changes in mw-staging [21:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:53] !log scb1001 - mobile endpoint health CRIT [21:15:56] mutante: afaik that's the host godog is converting to multi-instance [21:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:13] mutante: re restbase2001 [21:16:52] twentyafterfour: as far as i understood it you could always download it but this should directly show it in browser [21:17:05] (03PS1) 1020after4: 1.27.0-wmf.6 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252284 [21:17:06] gwicke: thanks,ok [21:17:46] (03CR) 1020after4: [C: 032] 1.27.0-wmf.6 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252284 (owner: 1020after4) [21:17:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html-sections-lead/{title} is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-html-sections-lead responds with malformed body: NoneType object has no attribute __getitem__ [21:18:11] (03Merged) 10jenkins-bot: 1.27.0-wmf.6 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252284 (owner: 1020after4) [21:18:15] gwicke: this? "T95253: Test multiple Cassandra instances per hardware node" [21:18:39] mutante: yeah, currently looking for a SAL entry [21:18:49] just saw his latest comment. yeah that fits [21:18:59] 6operations, 10fundraising-tech-ops: remove fundraising banner log related cruft from production puppet - https://phabricator.wikimedia.org/T118325#1797487 (10Jgreen) 3NEW [21:19:11] ACKNOWLEDGEMENT - cassandra CQL 10.192.16.152:9042 on restbase2001 is CRITICAL: Connection refused daniel_zahn https://phabricator.wikimedia.org/T95253 [21:19:13] heard from urandom that godog was in the process of decommissioning restbase2001 in preparation for the move to multi-instance [21:19:17] now i wonder about the pending changes on mira [21:19:35] 2015-11-09 13:16 godog: nodetool decomission restbase2001 [21:19:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:19:58] 6operations, 10fundraising-tech-ops: remove fundraising banner log related cruft from production puppet - https://phabricator.wikimedia.org/T118325#1797494 (10Jgreen) [21:20:03] gwicke: *nod*, thanks "I've started nodetool decommission yesterday morning on restbase2001, still going" [21:20:59] (03CR) 10Jgreen: "This is tracked with the blocking dependency in https://phabricator.wikimedia.org/T118325" [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn) [21:21:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [21:24:46] 6operations, 6Performance-Team, 5Patch-For-Review: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1797508 (10Dzahn) re-added tungsten to DHCP/puppet. reinstalled with jessie. initial puppet run has happened, users have been created. [21:24:54] so yeah I can't apply security patches because mwdeploy can't read the patch files and my user can't create /srv/mediawiki-staging/php-1.27.0-wmf.6/.git/rebase-apply [21:25:01] 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1797516 (10Dzahn) [21:27:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [21:27:12] cp the patches, and chown mwdeploy? [21:27:33] 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1797523 (10Dzahn) 5Open>3Resolved [21:30:50] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [21:31:04] it seems that the issue that twentyafterfour describes also causes this: [21:31:16] mira: There are 5 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [21:31:19] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [21:31:30] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [21:31:36] ah :) @ aqs [21:32:05] Reedy: yes but I shouldn't have to do that to deploy the train [21:32:16] I don't disagree [21:32:36] why was it changed in the first place is what I want to know, it doesn't seem like the extra hassle is justified and it wasn't announced anywhere that I saw [21:33:03] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [21:37:06] 6operations: Altert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1797560 (10aaron) [21:37:20] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1797562 (10Andrew) [21:37:24] 6operations: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1797563 (10ori) [21:39:51] also running git am as mwdeploy doesn't work because git needs to know committer information [21:39:57] so yet again blocked [21:40:38] (03PS2) 10Dzahn: Move role declaration earlier to make the role keyword work [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [21:41:48] does someone else need to do the train deployment today? [21:42:40] who can do it? someone with root could fix permissions on /srv/mediawiki-staging then I could do it [21:42:59] (03PS3) 10Dzahn: labcontrol1002: make the nova controller role keyword work [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [21:43:08] Krenair: I am able to sudo mwdeploy, but that isn't enough to prep the new branch properly [21:43:29] (03CR) 10Dzahn: [C: 032] labcontrol1002: make the nova controller role keyword work [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [21:45:09] andrewbogott: ^ and now that actually gets the settings from hiera, which means it fixes the "paramiko needs to ssh to it for designate" [21:45:29] great! [21:47:12] twentyafterfour, so you're creating a new branch, where are you on https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_patches_to_update_wikiversions.json ? [21:47:35] Krenair: applying the security patches [21:47:54] I'm sure I could keep working around problems but I'd like to address the root problem [21:48:14] (03Abandoned) 10Dzahn: kafka: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250068 (owner: 10Dzahn) [21:48:25] instead of spending more time fiddling with it when a single chmod could have already saved me an hour of hassles [21:49:02] (03CR) 10Dzahn: "thanks @JGreen" [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn) [21:49:07] (03Abandoned) 10Dzahn: kill misc/fundraising.pp, move to role logging [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn) [21:50:02] specifically "Check prior version (e.g. 1.24wmf3) for locally applied security patches and apply to new branch as needed" [21:50:15] (03Abandoned) 10Dzahn: palladium: add conftool::master to role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250080 (owner: 10Dzahn) [21:50:19] yes. I can see you're in the middle of applying a patch [21:50:59] (03CR) 10Dzahn: [C: 04-1] Move base::firewall include in the kibana and logstash roles [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff) [21:51:22] !log installed Zuul package on scandium.eqiad.wmnet . zuul-merger not running [21:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:35] (03CR) 10Dzahn: "can be abandoned now. tungsten has been re-added" [puppet] - 10https://gerrit.wikimedia.org/r/251129 (owner: 10Ori.livneh) [21:51:41] it needs a merge [21:51:45] If I copy my ~/.gitconfig to mwdeploy's home dir then `git am` will probably work as mwdeploy.. but I really don't understand why the deployment directory permissions were changed ... [21:51:47] (03Abandoned) 10Ori.livneh: Remove references to tungsten, which no longer exists [puppet] - 10https://gerrit.wikimedia.org/r/251129 (owner: 10Ori.livneh) [21:52:14] ori, can you fix the permissions please? [21:52:35] what's the issue? [21:52:54] (03PS2) 10Dzahn: Replaces duplicate myspell-de-ch with myspell-de-de in ores/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/252282 (owner: 10Halfak) [21:53:00] (03CR) 10Dzahn: [C: 032] Replaces duplicate myspell-de-ch with myspell-de-de in ores/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/252282 (owner: 10Halfak) [21:53:24] uhm... did the user sticky bit get set recently on /srv/mediawiki-staging? (tin) [21:53:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds [21:54:27] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1797612 (10hashar) [21:55:30] Krenair: drwxrwsr-x 24 mwdeploy wikidev 4096 Nov 10 21:21 mediawiki-staging [21:55:34] where do you see the sticky bit? [21:55:37] I don't [21:55:49] That's why I was confused [21:56:04] But somehow php-1.27.0-wmf.6 and everything under it has ended up owned by mwdeploy [21:56:59] (03CR) 10Dzahn: [C: 031] Add .pep8 exception for line length [puppet] - 10https://gerrit.wikimedia.org/r/251435 (owner: 10Yuvipanda) [21:57:25] mutante: that didn't actually work though :( [21:57:59] (03CR) 10Dzahn: "i don't know why they were removed or why they should be re-added" [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [21:58:01] (03Abandoned) 10Yuvipanda: Add .pep8 exception for line length [puppet] - 10https://gerrit.wikimedia.org/r/251435 (owner: 10Yuvipanda) [21:58:07] YuviPanda: oh..ok [21:58:25] (03PS1) 10Andrew Bogott: Switch the partman recipe for labtestneutron. [puppet] - 10https://gerrit.wikimedia.org/r/252332 (https://phabricator.wikimedia.org/T117097) [21:58:47] (03PS2) 10Andrew Bogott: Switch the partman recipe for labtestneutron. [puppet] - 10https://gerrit.wikimedia.org/r/252332 (https://phabricator.wikimedia.org/T117097) [22:00:05] (03CR) 10Andrew Bogott: [C: 032] Switch the partman recipe for labtestneutron. [puppet] - 10https://gerrit.wikimedia.org/r/252332 (https://phabricator.wikimedia.org/T117097) (owner: 10Andrew Bogott) [22:00:21] (03PS1) 10Ori.livneh: Enable Redis Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/252333 [22:00:25] (03CR) 10Dzahn: "what Rush and Ori said. the idea seems right but it has been a while and the variable names .." [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [22:01:51] twentyafterfour, are you still changing things in there? [22:01:59] (03PS4) 10Hashar: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) [22:02:03] (03CR) 10Dzahn: [C: 031] contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [22:02:28] Krenair: yes [22:02:32] ok [22:03:19] drwxrwsr-x <-- isn't the s in there the sticky bit? [22:04:02] (03CR) 10Dzahn: [C: 031] Allow same perms to tileratorui as tilerator [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) (owner: 10Yurik) [22:05:11] (03PS2) 10Dzahn: contint: drop reference to pip.conf [puppet] - 10https://gerrit.wikimedia.org/r/250004 (owner: 10Hashar) [22:05:36] 6operations: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1797653 (10ori) [22:06:04] (03CR) 10Dzahn: [C: 032] contint: drop reference to pip.conf [puppet] - 10https://gerrit.wikimedia.org/r/250004 (owner: 10Hashar) [22:06:45] twentyafterfour, setgid? [22:07:40] (03PS2) 10Ori.livneh: Enable Redis Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/252333 [22:07:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Enable Redis Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/252333 (owner: 10Ori.livneh) [22:10:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [22:11:52] twentyafterfour: No, that's setgid. Sticky is t at the end [22:12:02] ok, here’s a dumb question: Given an installer shell on a server, how can I tell how many drives it has? [22:12:08] partman is baffled and the bios isn’t telling me much. [22:12:35] If a directory is setgid, then IIRC that means that files created inside it will be owned by the group that owns the directory, regardless of the creating user's group [22:12:44] (03PS1) 10Hashar: contint: setup zuul-merger on scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/252336 (https://phabricator.wikimedia.org/T95046) [22:12:46] (03PS1) 10Hashar: contint: pool in zuul-merger on scandium [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) [22:13:50] (03CR) 10Hashar: [C: 031] "This is a blocker to deploy zuul-merger on scandium.eqiad.wmnet (labs host) T95046" [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [22:13:55] andrewbogott: try "dmidecode" [22:14:22] mutante: is that a standard tool? I have a minimal installer shell, not a lot of tools. [22:14:24] no vi for example [22:14:36] blkid? [22:14:44] not sure if that's in a minimal shell [22:14:54] you can also look at /dev maybe [22:15:14] dmesg [22:15:20] that too [22:15:26] blkid doesn’t say anything but returns 2 [22:15:26] (03CR) 10Hashar: [C: 04-1] "Do not merge/deploy until we have confirmed the zuul-merger process works properly." [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [22:15:59] /dev # ls sda* [22:15:59] sda sda1 sda2 sda3 [22:16:05] RoanKattouw: then how did it get owned by a different user? :-/ [22:16:05] dmesg | grep "sd [22:16:11] I'm very confused [22:16:12] (03CR) 10Paladox: "I am not sure why they were removed but I use tags on metrolook." [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [22:16:19] andrewbogott: so that's one disk + 3 partitions, I think? [22:16:31] twentyafterfour: That only applies to group ownership, not user ownership [22:16:36] andrewbogott: wait, is there nothing more for sd* [22:16:38] ? [22:16:53] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/251034 (https://phabricator.wikimedia.org/T117621) (owner: 10Paladox) [22:16:55] /dev # ls sd* [22:16:55] sda sda1 sda2 sda3 [22:17:02] so, nope [22:17:08] So if I'm catrope:wikidev, and there's a dir that's setgid and owned by apache:apache, the ownership of files I create in that dir will be catrope:apache [22:17:09] looks like 1 disk 3 partitions, could be wrong [22:17:12] just one disk then [22:17:29] hm, that’s what I thought. I picked a one-disk partman recipe but it still hates me [22:17:37] ah ha! someone changed the scripts to sudo, so it's not the sticky bit at all.. [22:17:39] wtf [22:17:46] oh lol [22:17:47] andrewbogott: dmesg | grep Attached [22:18:10] yep, looks like the answer is ‘one' [22:18:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 8 below the confidence bounds [22:18:30] 4e7f22c8 (Chad Horohoe 2015-11-05 12:14:15 -0800 220) // Do stuff as mwdeploy [22:18:33] Sticky is different from setgid BTW, but if sudo is your problem then you don't necessarily care [22:18:39] (03CR) 10Hashar: [C: 04-1] "Puppet compilation is https://puppet-compiler.wmflabs.org/1229/scandium.eqiad.wmnet/ pass but it still shows the gerrit replication bits " [puppet] - 10https://gerrit.wikimedia.org/r/252336 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [22:19:23] andrewbogott: no other server that already exists and is the same model? [22:19:25] RoanKattouw: isn't there a setuid bit? I forget the subtleties of sticky and setuid because I've resolved to never use sticky bit for anything [22:19:27] "< Krenair> But somehow php-1.27.0-wmf.6 and everything under it has ended up owned by mwdeploy" -- ostriches made that change to the createbranch scripts for the master-master scap sync [22:19:46] mutante: no idea. I don’t have a racktables login and I like it like that [22:19:52] bd808: it breaks lots of things [22:20:11] huh, actually the partman log looks just fine [22:20:12] twentyafterfour: Summary of the subtleties as I remember them: [22:20:20] twentyafterfour: did the group ownership end up wrong? [22:20:37] bd808: no, but some things can't be done with just group ownership [22:20:38] mutante: for what it’s worth, this is WMF3763 [22:20:45] setgid on a directory: if someone is allowed to create a file in this dir (i.e. has write rights), and they do, then make that file be owned by my group instead of that user's group [22:20:52] bd808: because the files don't have group write, for example [22:20:52] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/251327/ [22:21:38] sticky bit on a directory: anyone can create files in this dir, but people can only delete their own files, not others' files (normally, you can both create and delete files if you have write access to the dir) [22:21:39] ugh. lack of group write is a umask issue [22:21:44] sticky is typically used for /tmp [22:22:17] bd808: doesn't the git checkout force the permissions to match the repo? [22:22:23] andrewbogott: dell poweredge r410. others like that are elastic10xx, emery, hafnium, rhenium.. a whole bunch. try to use the same partman they use [22:22:35] setuid/gid on an executable: execute this with the rights of the user/group of the owner rather than the user/group of the person executing it (massively simplifying on this one but that's the idea) [22:22:43] mutante: we bought r410s with all different drive configs though [22:22:49] unless you’re already controlling for that? [22:23:05] no, i'm not. i dont think you can search for that in racktables [22:23:15] twentyafterfour: I'm not sure. I don't think it should generally unless the files were added with explicit permissions bits [22:23:24] mutante: ok, well, I’ll try your advice anyway — can’t hurt :) [22:23:45] andrewbogott: i'd ask on the procurement ticket for an example of another server with the same drive setup [22:24:11] oh! The problem could also be that I don’t know how to code. there’s a wildcard that matches my server /above/ the custom line that I added. I bet that won’t work! [22:24:27] but the sudo could be picking up a 0022 umask instead of 0002 [22:24:40] bd808: most likely yes [22:25:44] (03Abandoned) 10Dzahn: holmium: use role keyword for 2 classes [puppet] - 10https://gerrit.wikimedia.org/r/248099 (owner: 10Dzahn) [22:25:46] * twentyafterfour tries sudo -u mwdeploy chmod g+w /srv/mediawiki-staging/php-1.27.0-wmf.6 [22:27:06] (03PS1) 10Andrew Bogott: netboot: Explode the labtest* wildcard. [puppet] - 10https://gerrit.wikimedia.org/r/252341 (https://phabricator.wikimedia.org/T117097) [22:27:15] (03CR) 10Dzahn: "can only use the role keyword once, like _joe_ said" [puppet] - 10https://gerrit.wikimedia.org/r/246828 (owner: 10Faidon Liambotis) [22:27:23] (03PS2) 10Andrew Bogott: netboot: Explode the labtest* wildcard. [puppet] - 10https://gerrit.wikimedia.org/r/252341 (https://phabricator.wikimedia.org/T117097) [22:28:10] (03CR) 10Andrew Bogott: [C: 032] netboot: Explode the labtest* wildcard. [puppet] - 10https://gerrit.wikimedia.org/r/252341 (https://phabricator.wikimedia.org/T117097) (owner: 10Andrew Bogott) [22:28:42] (03CR) 10Dzahn: [C: 031] dataset: remove system::role from the dataset module [puppet] - 10https://gerrit.wikimedia.org/r/246827 (owner: 10Faidon Liambotis) [22:28:44] akosiaris, was it approved? https://gerrit.wikimedia.org/r/#/c/249501/ [22:29:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [22:31:27] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1797786 (10Ottomata) Just made a calendar event for Tuesday at 10:30 PST. Happy to move it if some other time is better. [22:32:27] (03CR) 10Dzahn: [C: 031] Support easy cloning of git repositories from Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/244715 (owner: 10Chad) [22:32:41] 6operations, 5Continuous-Integration-Scaling: Upload new Zuul packages on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T118340#1797788 (10hashar) 3NEW a:3hashar [22:33:09] (03PS2) 10Dzahn: Add base::resolving::domain_search for iron [puppet] - 10https://gerrit.wikimedia.org/r/246679 (owner: 10Reedy) [22:33:43] (03CR) 10Dzahn: [C: 032] Add base::resolving::domain_search for iron [puppet] - 10https://gerrit.wikimedia.org/r/246679 (owner: 10Reedy) [22:33:55] 6operations: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1797810 (10ori) For rdb1007, redis indicates that used_memory is 33694135304 and that maxmemory is 41875931136, which means it's using up ~80% of the memory available to it. Since you asked for the thre... [22:34:37] (03CR) 10Dzahn: [C: 031] "never seen this used" [puppet] - 10https://gerrit.wikimedia.org/r/243888 (owner: 10Alexandros Kosiaris) [22:35:37] (03PS2) 10Dzahn: icinga: remove unused notify-by-epager commands [puppet] - 10https://gerrit.wikimedia.org/r/243888 (owner: 10Alexandros Kosiaris) [22:39:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [22:41:07] 6operations: Alert when used_memory gets too high for redis queues - https://phabricator.wikimedia.org/T118331#1797876 (10aaron) Yeah, the current state right now was page worthy, I just noticed it by accident with erik complaining about delayed jobs not being run. There should have been an SMS. [22:41:36] !log twentyafterfour@tin Started scap: testwiki to 1.27.0-wmf.6 [22:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:59] (03CR) 10Dzahn: [C: 032] icinga: remove unused notify-by-epager commands [puppet] - 10https://gerrit.wikimedia.org/r/243888 (owner: 10Alexandros Kosiaris) [22:43:19] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.sEpytaqlt1" ' returned non-zero exit status 1 (duration: 01m 42s) [22:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:19] twentyafterfour: if you run that command by hand you should be able to see what made it puke if its not already obvious [22:44:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [22:44:49] (03CR) 10Dzahn: "added Ryan Lane for his opinion ..because yea, i also see this on every single puppet run on the deployment server(s)" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [22:47:17] (03CR) 10Dzahn: "is this waiting for something specific?" [puppet] - 10https://gerrit.wikimedia.org/r/230738 (owner: 10ArielGlenn) [22:47:40] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [22:48:28] bd808: Extension /srv/mediawiki-staging/php-1.27.0-wmf.5/extensions/QuickSurveys/extension.json doesn't exist [22:48:44] aw crap [22:48:46] the extension doesn't exist in wmf.5 but does in wmf.6, why is this an error? [22:49:19] because we added it to the extension-list that is not versioned [22:49:27] there is something that can fix this... [22:49:40] (03CR) 10Dzahn: [C: 031] "has +1's and looks simple enough but is a few months old" [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [22:49:43] why is the extension list not versioned, anyway? [22:50:09] *shrug* [22:50:15] is Reedy about? [22:50:23] I am [22:50:51] what's the trick for adding an new extension and keeping mergeMessageFileList.php from blowing up for the old version? [22:51:03] Add a version specifid extension-list [22:51:06] wmf.5 doesn't have the extenstion by wmf.6 does [22:51:38] Um [22:51:44] The code seems to have gone from CommonSettings.php [22:52:57] can you add a version to the older branch (even if its not used)? [22:53:02] that's not very convenient [22:53:11] (03CR) 10Dzahn: [C: 031] Tools: Unpuppetize host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) (owner: 10Tim Landscheidt) [22:53:12] (the code being gone) [22:53:23] p858snake: that would fix it too, yes [22:53:37] How is the deployment train going? [22:53:42] Krinkle: not so well [22:53:45] https://github.com/wikimedia/operations-mediawiki-config/blob/d61a8145019a8b5d9363e7f6ad44c6cfceb03c46/wmf-config/CommonSettings-labs.php#L242-L244 [22:53:51] * Krinkle reads back [22:53:55] it was like that but extension-list-$wgVersion [22:54:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [22:54:16] (03CR) 10Dzahn: "@hashar happy to merge this at a time convenient for you" [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [22:54:28] What's the betting it was removed as "unused"? lol [22:55:17] yes [22:55:17] https://github.com/wikimedia/operations-mediawiki-config/commit/affc5572f4f5e959cdded0c458766ad6aafa1331 [22:55:33] Which is somewhat fair. There's no need to check the disk on every request for something we know. [22:55:59] Mmm [22:56:06] Then we have things like this happen [22:57:01] So, without re-adding that code (temporarily or otherwise) [22:57:02] [22:52:57] can you add a version to the older branch (even if its not used)? [22:57:04] That's the fix [22:57:27] what does it mean to add a version to the older branch? [22:57:42] Add the new extension submodule to the older branch [22:57:53] add the wmf.6 submodule to the wmf.5 branch [22:57:55] that's pretty lame [22:57:58] yeah [22:57:59] yup [22:58:09] Krinkle: don't think that you can just use wmfRealms [22:58:26] getRealmSpecificFilename() provides datacenter variance :-/ [22:58:31] Noting, ori only just removed the code effectively [22:58:32] shouldn't the extension list be in the branch really? [22:59:12] Maybe [22:59:13] that would probably make more sense [22:59:17] If soemthing populates it everytime [22:59:22] hashar: I know. The code doesn't deny that. It just changes it to be more explicit. So instead of checking 5 files that don't exist for every 1 file. Just check the one directly. So files that don't vary by DC don't need to check it. [22:59:30] then it could be generated when the initial clone is made on tin [22:59:41] hashar: there are other ways we could do this [22:59:48] by scanning skins and extensions [22:59:50] like having a per-realm config directory [22:59:53] How do we know where all the entry points are? [23:00:10] which has the full set of config files, either as symlinks to the generic file (if there is no need for realm-specific logic) [23:00:17] lets port mediawiki-config to hiera() ! [23:00:19] or an actual file [23:01:09] hashar: the biggest issue was https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWRealm.php#L15-L67 [23:01:16] from the doc block: "Files checked are: base-realm-datacenter.ext, base-realm.ext, base-datacenter.ext, base.ext " [23:01:20] twentyafterfour: are tin perms wonky now? [23:01:21] so, take mc.php -- in production that means: file_exists( 'mc-production-eqiad.ext' ) -> false -> file_exists( 'mc-production.php' ) -> false -> file_exists( 'mc-eqiad.php' ) -> false -> file_exists( 'mc.php' ) -> true -> require mc.php [23:01:22] Reedy: the only place it would be hard to figure out is for an extension that has a strangely named entry point I think? [23:01:35] bd808: Yeaah, which shouldn't be the case now... It was mostly standardised [23:01:41] AaronSchulz: yes sorta [23:01:42] ori: yeah I am very well aware of that code base, worked on it with anomie ages ago . [23:01:43] we had dozens of these and they ran on every request [23:01:51] AaronSchulz: I manually fixed a bunch of stuff [23:01:52] rebase gives some perm errors [23:01:58] well aware of it sure, but maybe not well aware of the performance implications [23:01:59] There might be a few symlinks/require_once type files [23:02:02] ori: glad to see miliseconds are recovered by killing that feature :} [23:02:04] AaronSchulz: what are you rebasing? [23:02:26] * ori -> afk [23:02:31] bd808: if extensionname/extensionname.php exist, use, else try extensionname/extension.json [23:02:33] php-1.27.0-wmf.6, after fetch, against origin, to get the change applied [23:02:33] etc [23:02:41] Hi, does anyone know why kmlexport isn't working? [23:02:43] Reedy: *nod* [23:03:04] please add me as a reviewer if you're planning on reintroducing something that checks for multiple filename patterns [23:03:22] i think there are ways to do that that would work ok [23:03:29] or to achieve the same goal, rather [23:03:36] twentyafterfour: the change in <> (redis aggregator) [23:03:47] off to sleep, have happy hacking [23:04:12] AaronSchulz: the train hasn't even been deployed yet because permissions issues slowed me down a bunch [23:04:30] ok, diff is also showing unstaged changes [23:04:31] twentyafterfour: do you want me to make the wmf.5 hack for the new extension or ??? [23:04:38] bd808: I'm just doing it [23:04:48] thx Reedy [23:05:06] I'm gonna use the .6 branch because cba branching [23:05:12] bd808: I'd like to actually fix the stupid architecture [23:05:13] It's not gonna be used [23:05:15] but oh well [23:05:26] https://gerrit.wikimedia.org/r/252346 [23:05:31] twentyafterfour: I'm for that too [23:06:26] maybe I'll just wait till this sorts out [23:07:17] AaronSchulz: probably not best to patch in the middle of the train deploy [23:07:19] twentyafterfour: the unstaged stuff involves some RequestHasSameOriginSecurity hook [23:07:38] bd808: heh, it mostly be really backed up then ;) [23:08:58] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/252346/ should get you moving again [23:10:42] AaronSchulz: can I just git rebase --abort? [23:13:15] twentyafterfour: no rebase is in progress [23:13:33] why all the unstaged stuff then? [23:13:47] (03CR) 10BryanDavis: "Makes deploying a new extension require backporting the extension code to the old branch (eg Iac881175cf4d0e9079aa81a5f6e6ac8a6cfa1b9b) an" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250170 (owner: 10Ori.livneh) [23:14:17] I guess I'm gonna git reset --hard, not sure what happened [23:14:27] yeah I already aborted, that's odd [23:14:32] mutante: partman success! Thanks for your help. [23:14:49] twentyafterfour: are you sure that won't less the last sec patch or something? [23:15:05] (03CR) 10Ori.livneh: "@bd808: ack, makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250170 (owner: 10Ori.livneh) [23:15:56] Hi, does anyone know why kmlexport isn't working? [23:16:06] twentyafterfour: the rebase had both unlink perm errors and "unstaged changes" [23:16:07] AaronSchulz: the security patches are committed [23:16:13] bd808: .oO( some mutual exclusion mechanism is needed not just for sync-* (which uses lock files effectively) but for staging changes on tin ) [23:16:19] so if they were there to begin with, that would be a problem [23:16:37] or maybe just the perm errors created "unstaged changes" [23:16:39] ori: indeed [23:17:01] the later is a subset of the former, so that might make sense [23:17:08] so yeah, could just be noise [23:17:16] ori: for sure. I think we have a ticket wishing for that [23:18:11] in the long term it needs to be a cross-master lock/notification too [23:18:29] so that tin and mira aren't messed with at the same time [23:18:55] anybody here familiar with trebuchet deployments? [23:19:03] SMalyshev: yeah [23:19:08] need help? [23:19:35] bd808: yes. I have deployment broken on wdqs1001 (https://phabricator.wikimedia.org/T118148) for a while now [23:20:04] so the problem is that not only is the user mwdeploy but the group is mwdeploy, not wikidev [23:20:16] so permissions are still all messed up [23:20:22] so I need somebody to knows what's going on there to take a look [23:20:39] twentyafterfour: what ownership should i set for which path? [23:21:07] twentyafterfour: crap, yeah the sudo won't set the group we want. Probably easiest to roll back Chad's config change, nuke the clone and start over [23:21:34] RoanKattouw, ostriches, Krenair (not sure who's doing the deployment). Am I all set for SWAT deployment of this (https://gerrit.wikimedia.org/r/#q,251133,n,z) patch this evening? Haven't done one of these before so wanna make sure the calendar was the only thing I needed to update. [23:21:41] we had to add a special sudoers rule to support the group change for the cross-master rsync [23:22:08] group change? [23:22:19] I don't understand why any of this is necessary [23:22:42] SMalyshev: you need help from a root for those errors. usually that means the salt-minion service on the target hosts need to be restarted [23:22:42] mtimes suck. [23:22:48] ori: chgrp -R wikidev /srv/mediawiki-staging/php-1.27.0-wmf.6 [23:22:48] ^ that [23:23:02] bd808: do you know who could do this? [23:23:13] !log ran sudo chgrp -R wikidev /srv/mediawiki-staging/php-1.27.0-wmf.6 on tin [23:23:16] SMalyshev: looking into it [23:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:24] ori: thanks! [23:23:26] ori: thank you [23:23:45] bd808: why does it need to be mwdeploy ? [23:23:56] can we just make deployers members of that group? [23:24:12] jhobs: Yeah updating the wiki page is the only thing you need to do [23:24:20] twentyafterfour: mwdeploy because that is the user we use for the scap automation [23:24:23] cool thanks RoanKattouw [23:24:31] jhobs: That and saying "I'm here" in this channel when the bot pings you ~30 mins from now [23:24:38] ok [23:25:33] twentyafterfour: and that user is now running an rsync on mira to copy the state of /src/mediawiki-staging on each scap/sync-* [23:25:56] the rsync will try to set mtimes and that fails unless you are root or the owner of the directory [23:26:14] SMalyshev: I spent the first few minutes wondering why I couldn't SSH into these hosts until I realized I had transposed 'd' and 'q' and was trying to SSH into wqds1001 instead of wdqs1001 [23:26:17] so, progress! [23:26:27] heh :) [23:26:38] twentyafterfour: we should have kept you in the loop on this better as Chad, Alex and I worked on the feature [23:26:53] wdqs1001 seems to be alive and well, except for deployment part [23:27:22] SMalyshev: voila: [23:27:24] 2015-11-10 23:16:02,487 [salt.loaded.int.module.cmdmod][ERROR ] Command u'/usr/bin/git checkout --force --quiet tags/wdqs/wdqs-sync-20151110-061335' failed with return code: 128 [23:27:24] 2015-11-10 23:16:02,488 [salt.loaded.int.module.cmdmod][ERROR ] output: error: object file .git/objects/3d/04163672ea211cd6dc65cc0672265cb28d130e is empty [23:27:24] error: object file .git/objects/3d/04163672ea211cd6dc65cc0672265cb28d130e is empty [23:27:26] fatal: loose object 3d04163672ea211cd6dc65cc0672265cb28d130e (stored in .git/objects/3d/04163672ea211cd6dc65cc0672265cb28d130e) is corrupt [23:27:54] AaronSchulz: your patch will get deployed with the train [23:27:56] i think the idea with trebuchet is that you are supposed to intuit such conditions telepathically [23:28:05] and fix them the same way [23:28:16] heh. you are supposed to have root [23:28:43] ok I get that something is wrong but I have no idea what is wrong except "git is broken" [23:28:54] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1798026 (10ori) ``` [wdqs1001:/var/log] $ sudo cat /var/log/salt/minion 2015-11-10 23:16:02,487 [salt.loaded.int.module.cmdmod][ERROR ] Command u'/usr/bin/git checkout --f... [23:29:05] the clone got messed up somehow on that host [23:29:38] I've seen it happen a few times before (not many) [23:29:44] Good grief! [23:29:51] i'll nuke it so it gets recreated and we'll hope it won't happen again. but this ought to be a trebuchet bug, or a me-too on one of the existing patches [23:30:06] RodHullandEmu: kmlexport! [23:30:22] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [23:30:34] RodHullandEmu: I don't know what that is, let alone why it isn't working. But if you are not getting an answer, file a task in Phab! https://phabricator.wikimedia.org/maniphest/task/create/ [23:30:45] *one of the existing bugs [23:30:49] is it this? -- https://www.mediawiki.org/wiki/Extension:KML_Export [23:31:07] I was wondering if it was maps related [23:31:27] MaxSem: ^ [23:31:38] Ah. Probably https://tools.wmflabs.org/kmlexport/?project=en [23:31:45] somebody's tool [23:32:26] oh [23:32:46] RodHullandEmu: "If you're pretty sure this shouldn't be an error, you may wish to notify the tool's maintainers (above) about the error and how you ended up here." [23:32:56] the ext's not ours [23:32:58] in this case https://wikitech.wikimedia.org/wiki/User_talk:Para [23:33:00] It exports geo-coordinates to mapping sites but is up and down like the Assyrian empire. [23:33:18] RodHullandEmu: isn't the Assyrian empire pretty persistently down these days? [23:33:19] PROBLEM - Host labtestneutron2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:11] bd808: can we just ignore the mtime problems? or make mwdeploy be a member of wikidev group? or even make deployers use the mwdeploy group instead of wikidev? [23:34:19] yeah now I see git checkout is all messed up there... that's why it doesn't work [23:34:24] why do I keep getting logged off gerrit? [23:34:43] RodHullandEmu: You might be able to get some help in #wikimedia-labs. valhallasw`cloud, YuviPanda or andrewbogott in that channel could at lest check to see if the errors make sense (bad code) or are a system problem [23:34:56] MaxSem: I keep destroying your session just to frustrate you ;-) [23:35:15] twentyafterfour: I couldn't figure out how to sort the mtime issues (not a big deal) from more important rsync failures [23:35:19] bd808: I'll try that, thanks. The tool's owner isn't around much [23:35:30] * MaxSem threatens to undeploy ostriches [23:35:50] RECOVERY - Host labtestneutron2001 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [23:36:10] MaxSem: If you take away my deployment rights I can't be held responsible anymore ;-) [23:36:29] twentyafterfour: making the deploy user's group and the deployer's group match is not awesome for privilege separation [23:36:58] which is why what we did on the rsync sudo call was a bit of a hack [23:37:05] bd808: If we were all the same user we'd never have these sorts of problems :p [23:37:13] so true [23:37:45] united we stand, divided we fall [23:37:50] if we weren't all sharing the mediawiki-staging directory a lot would be better [23:38:02] preach brother [23:38:11] * ostriches gives twentyafterfour the staging directory [23:38:14] There, it's yours! [23:38:17] SMalyshev: i'm still on it btw [23:38:28] ori: cool, thanks! [23:39:28] ori: what's the protocol in this case? there are some non-git files there also that puppet produces. [23:39:54] !log twentyafterfour@tin Started scap: trying again: sync 1.27.0-wmf.6 [23:39:54] nuke the dir, force a puppet run, cross fingers [23:39:55] bblack, ori: yt? [23:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:09] nuria: sorta, sup? [23:40:10] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.yIUW9abUwO" ' returned non-zero exit status 1 (duration: 00m 15s) [23:40:19] Again? [23:40:29] ori, bblack: wanted to share that we were looking at our last-access cookie unique calculations and the nocookie header on varnish has helped quite a bit to narrow down numbers for daily usage [23:40:33] SMalyshev: I'll update the task with the list of commands I ran [23:40:36] You did pull an checkout the submodule? [23:40:47] nuria: annnnnnd ? :) [23:40:47] ori: thanks! [23:41:03] !log twentyafterfour@tin Started scap: grr: sync 1.27.0-wmf.6 [23:41:05] ori, bblack: we need to have numbers for 1 month in order to see if it works for those just as well [23:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:13] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.7rNhS6PIRi" ' returned non-zero exit status 1 (duration: 00m 09s) [23:41:25] twentyafterfour: /srv/mediawiki-staging/php-1.27.0-wmf.5/extensions/QuickSurveys is empty [23:41:57] ori, bblack : we can probably share VERY preliminary numbers in couple weeks. [23:42:16] cool! [23:42:21] git submodule --init extensions/QuickSurveys [23:42:55] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1798052 (10Andrew) [23:42:57] !log twentyafterfour@tin Started scap: grr: sync 1.27.0-wmf.6 [23:43:11] should work now :-/ [23:43:21] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10Andrew) All the boxes now have an OS installed and puppet and salt signed and running. [23:43:46] twentyafterfour: sorry you are having such a long day with this [23:43:55] I remember how stressful it can be [23:44:02] bd808: thanks, it's mostly just tiresome [23:44:05] (03PS1) 10MaxSem: Switch www.wikimedia.beta.wmflabs.org to Git [puppet] - 10https://gerrit.wikimedia.org/r/252355 (https://phabricator.wikimedia.org/T118009) [23:44:06] bd808: I seem to recall an 8 hour deploy and fallout before now :( [23:44:36] Reedy: heh. or me waiting 140 mintues for a scap to finish [23:44:51] with the old wall of unreadable output [23:45:00] this is going on 5 hours so far but much of it is my fault for not following what everyone else is up to closely enough (and not realizing the source of my permission woes) [23:45:07] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1798058 (10ori) 5Open>3Resolved a:3ori First, I restarted `salt-minion`. This may or may not have been necessary. I also could have done it more concisely with 'restar... [23:45:11] SMalyshev: ^ [23:46:11] ori: thanks! it looks fine now [23:47:09] ori: thank you very much for handling it [23:47:20] no problem [23:47:21] andrewbogott: great :) [23:51:52] (03PS1) 10BryanDavis: Monolog: wrap channel handlers in a WhatFailureGroupHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252359 (https://phabricator.wikimedia.org/T118057) [23:52:10] ori: what do you think about doing that ^ for the log exceptions? [23:52:16] (03CR) 10jenkins-bot: [V: 04-1] Monolog: wrap channel handlers in a WhatFailureGroupHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252359 (https://phabricator.wikimedia.org/T118057) (owner: 10BryanDavis) [23:52:21] -1! [23:52:22] :D [23:52:28] i'll look in a bit, going to the gym [23:52:30] jenkins is the worst [23:52:48] * ori has his own editor engagement trend to reverse [23:53:19] some jerk wrote logging unit tests ;) [23:53:31] I'm not sure I know how to run them locally... [23:56:39] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures