[02:03:10] (03CR) 10Ori.livneh: [C: 032] HHVM: Enable translation cache garbage-collection on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/277061 (https://phabricator.wikimedia.org/T277061) (owner: 10Ori.livneh) [02:18:45] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.489 second response time [02:20:15] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.012 second response time [02:20:25] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.117 second response time [02:21:56] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.056 second response time [02:24:05] PROBLEM - HHVM rendering on mw1115 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [02:24:14] PROBLEM - HHVM rendering on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [02:25:46] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 71883 bytes in 0.279 second response time [02:27:44] RECOVERY - HHVM rendering on mw1118 is OK: HTTP OK: HTTP/1.1 200 OK - 71891 bytes in 0.514 second response time [02:28:45] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [02:29:04] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [02:29:35] PROBLEM - Apache HTTP on mw1116 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.017 second response time [02:30:13] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 13m 23s) [02:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:35] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.062 second response time [02:31:26] PROBLEM - HHVM rendering on mw1116 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.026 second response time [02:33:06] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 71885 bytes in 1.180 second response time [02:33:14] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.060 second response time [02:34:05] PROBLEM - Apache HTTP on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50395 bytes in 0.170 second response time [02:34:15] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.650 second response time [02:35:45] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.042 second response time [02:37:54] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.367 second response time [02:38:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 14 02:38:54 UTC 2016 (duration 8m 41s) [02:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:04] PROBLEM - HHVM rendering on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.107 second response time [02:39:06] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.028 second response time [02:39:35] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 71903 bytes in 0.464 second response time [02:39:54] PROBLEM - HHVM rendering on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.175 second response time [02:40:16] PROBLEM - HHVM rendering on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.056 second response time [02:40:45] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 71903 bytes in 0.411 second response time [02:40:54] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.056 second response time [02:41:35] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 71905 bytes in 1.176 second response time [02:41:45] PROBLEM - Apache HTTP on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50395 bytes in 0.856 second response time [02:42:35] PROBLEM - Apache HTTP on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50395 bytes in 0.933 second response time [02:42:44] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [02:42:56] PROBLEM - HHVM rendering on mw1115 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [02:43:05] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [02:43:24] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.043 second response time [02:43:36] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 71903 bytes in 0.325 second response time [02:44:15] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.664 second response time [02:44:44] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 71905 bytes in 1.171 second response time [02:44:45] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.044 second response time [02:45:24] well, so much for that. [02:45:41] (03PS1) 10Ori.livneh: Revert "HHVM: Enable translation cache garbage-collection on canary app servers" [puppet] - 10https://gerrit.wikimedia.org/r/277186 [02:45:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "HHVM: Enable translation cache garbage-collection on canary app servers" [puppet] - 10https://gerrit.wikimedia.org/r/277186 (owner: 10Ori.livneh) [02:46:05] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.061 second response time [02:49:25] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.013 second response time [02:51:06] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.085 second response time [02:57:55] PROBLEM - Apache HTTP on mw1118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [02:59:35] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.059 second response time [03:04:05] PROBLEM - HHVM rendering on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50395 bytes in 0.508 second response time [03:05:54] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 71911 bytes in 0.294 second response time [03:11:54] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [03:13:35] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.058 second response time [03:20:52] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2117051 (10GWicke) Here is what Kubernetes has to say about service disc... [05:19:01] !log Restarting HHVM on mw1025 with known-bad config option 'hhvm.enable_reusable_tc = true' for debugging [05:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:22:54] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.032 second response time [05:24:35] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.055 second response time [06:12:24] (03PS1) 10Madhuvishy: [WIP] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [06:30:34] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:35] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] <_joe_> ori: oh it's known-bad? [06:31:38] <_joe_> :( [06:32:04] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] <_joe_> yeah I see the history of alerts [06:33:44] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:54] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:10] (03PS2) 10Madhuvishy: [WIP] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [06:52:39] (03PS3) 10Madhuvishy: [WIP] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [06:53:22] 6Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 3 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2117214 (10Arrbee) [06:56:04] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:55] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:05] _joe_: yep. jemalloc in the backtrace [07:44:57] _joe_: yes, it segfaults almost immediately [07:45:12] in different places every time, too [07:45:42] <_joe_> ori_: just not to let you get bored :) [07:47:22] I don't think I'm going to debug it -- it would take me a lot of time, and I'm not sure the payoff would be worth it [07:48:54] it enables reclamation of memory occupied by dead translations in the translation cache [07:52:11] <_joe_> yeah I remember we were pretty excited about that feature, but we can live without it [07:52:37] well, there are two benefits that i can think of [07:53:20] one is that memory usage would be stable, indefinitely in principal, even as code is updated by successive deployments [07:55:14] but it is effectively stable for us, as best as I can tell. I don't see evidence of OOMs on the Apaches [07:57:27] the other is that it may allow us to switch interfaces messages from CDB to static PHP arrays [07:57:36] (03PS1) 10Muehlenhoff: Two additional CVE IDs fixed [debs/linux] - 10https://gerrit.wikimedia.org/r/277193 [07:57:50] when I last tried it, it worked until the next branch deployment [07:57:53] 6Operations, 10ops-eqiad: db1053 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T129829#2117255 (10jcrespo) [07:59:04] because interface messages occupy more than two gigs [07:59:49] ACKNOWLEDGEMENT - RAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T129829 [07:59:53] <_joe_> heh [07:59:54] and HHVM did not have a mechanism for garbage-collecting static strings from source files that no longer exist or have been updated [08:00:27] <_joe_> the only way to really move away from cdbs could be repoauth [08:00:41] <_joe_> but well, I have more urgent fish to fry now [08:00:55] <_joe_> we [08:01:09] yeah, it's too much work for speculative gains [08:01:25] <_joe_> re currently enqueueuing more than 420 K job/min vs 60K before this crisis [08:01:49] <_joe_> so it's getting worse, not better [08:02:05] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [08:02:30] I can't make anything of those numbers, because they never mean what I think they mean [08:02:39] <_joe_> yeah, exactly [08:02:44] there are recycled jobs and dupe jobs [08:03:08] <_joe_> can we say the whole thing is overcomplicated compared to what it should do? [08:03:27] <_joe_> I know that's the result of feature additions over the years [08:03:46] not really; the features have stayed the same [08:03:51] namely, there aren't any [08:04:24] it has been stretched to work well with a growing workload of greater variety [08:04:58] s2 continues to have a 25% more traffic than enwiki [08:05:02] <_joe_> heh [08:05:13] <_joe_> jynus: general traffic or jobqueue-related? [08:05:57] but it's very hard to understand and reason about [08:07:11] <_joe_> jynus: is that going to be a blocker for the switchover? [08:07:28] <_joe_> ori: it's a bit appalling we're not able to understand the origin of all those jobs [08:07:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Two additional CVE IDs fixed [debs/linux] - 10https://gerrit.wikimedia.org/r/277193 (owner: 10Muehlenhoff) [08:08:01] has anyone tried? [08:08:08] <_joe_> about recycled jobs, I proposed to kill jobchron for the moment but aaron said the number of recycled jobs is negligible [08:08:32] <_joe_> ori: I assumed someoen did; I was at a conference thursday and friday [08:08:49] maybe, Revision::fetchFromConds 127.0.0.1, the user localhost probably means jobqueue [08:09:12] not terbium? [08:09:50] it could be, too [08:10:08] I have that and the empty string [08:10:12] _joe_: I think several of us did, but AFAIK no one went debugging in the direction of where are the jobs coming from [08:10:27] it is not regular user queries [08:10:59] if they make up the bulk of query traffic, could you figure out where they're coming from with tcpdump? [08:11:05] <_joe_> I doubt it's terbium https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=terbium.eqiad.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1457943033&g=network_report&z=large&c=Miscellaneous%20eqiad [08:11:16] COMMIT /* JobRunner::executeJob */ [08:11:42] right, that's two knocks on that idea [08:12:05] <_joe_> ori: it's very late for you to investigate this [08:12:25] I'm enjoying this, in a very sick sort of way [08:13:03] I do not need it, I have enough profiling now on mysql [08:13:08] on s2 [08:14:34] mw1162.eqiad.wmnet, mw1164.eqiad.wmnet, ... [08:14:46] yeah, job runners [08:15:00] there are two angles to take: browse graphite and tried to determine which job spiked and when [08:15:19] stick a call to log a backtrace in the code that enqueues jobs [08:15:45] most are "getting" revisions, the ones that are not are links-update related [08:16:08] o/ [08:16:20] hi elukey [08:16:42] but not to a single wiki, enwiktionary, ptwiki, plwiki, svwiki [08:17:10] <_joe_> ori: 97% of enqueued jobs now are refreshlinks [08:17:31] <_joe_> jynus: they're all on s2, right? [08:17:45] <_joe_> or is it shard-independent? [08:18:46] I do not have a direct link to the job queue, but aproximately after Aaron fixed things, s2 has 2-3 times the number of reads [08:18:51] yeah and I guess Aaron was right, dedup / superseded / recycled / abandoned are negligent [08:18:58] negligible [08:19:01] not negligent :) [08:19:24] also, 20K QPS on the s2 master [08:19:27] <_joe_> ori: yup looking at graphite it's clear we're having a huge raise in refreshlinks jobs [08:20:01] <_joe_> it was 20% of the jobs last monday, now it's 97% [08:20:14] <_joe_> and it seems to be failing [08:20:15] yeah they're real jobs [08:20:22] <_joe_> at a certain rate [08:20:41] <_joe_> so either it's something with s2 causing refreshlink jobs to be spawned when they shouldn't be [08:21:04] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:21:22] <_joe_> like some queries returning bogus results and causing any edit to spur a ton of refreshlinks jobs [08:21:48] <_joe_> or it's some bug in mediawiki and then I'd expect the raise to be evenly distributed [08:22:27] <_joe_> ori: how can I see the job queue size by wiki? there is a mweval command for it right? [08:23:15] This is s2-master: https://phabricator.wikimedia.org/T129830 [08:23:38] <_joe_> sorry, brb [08:23:57] _joe_: mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php [08:24:27] <_joe_> ori: ok thanks [08:27:49] <_joe_> and yeah, it seems to be concentrated on a few wikis [08:28:47] <_joe_> ori: last bit of knowledge I'll ask you: where in the code is a refreshlinks job enqueued and why? [08:29:25] <_joe_> I guess the answer is "all over the place" [08:30:47] <_joe_> and no, it's not specific to s2, but a few of the affected wikis are there [08:32:01] <_joe_> including itwiki and enwiktionary [08:34:46] <_joe_> jynus: how is s3 doing? [08:35:10] chilling out [08:35:29] <_joe_> uhm, it holds another of the big offenders [08:35:35] technically the increase is proportionally as big [08:36:10] but in absolute numbers it is almost unnoticible [08:36:58] only the master has more traffic [08:44:23] s3-master: https://phabricator.wikimedia.org/T129830#2117325 [08:44:40] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2117329 (10Joe) This is not resolved at all; the main issue being we're enqueueing jobs at almost 10 times the rate that we normally see. Enqueued jobs that have raised are the... [08:47:00] commons traffic increase is similar in time, but unlike s2, barely noticeable [08:47:03] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2107949 (10Joe) p:5Normal>3Unbreak! [08:51:47] it could be that s2 traffic only happens to be more visible because it has more "capacity" [08:52:17] <_joe_> s2 has several of the offending wiki, vs just one in s3 and s4 [08:53:19] I have to move to another semi-UBN for labs, but I will be around [08:58:40] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117338 (10Joe) [09:15:55] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117356 (10Joe) The analysis of enabled extensions is not suggesting me any immediate correlation with the behaviour we're seeing. [09:17:26] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117372 (10Joe) The fact most of those refreshLinks jobs are bogus is confirmed by the fact that the edge cache purge rate [[https://graphite.wikim... [09:28:43] (03PS6) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [09:34:08] (03PS1) 10Muehlenhoff: Add ferm rules for redis access on maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/277197 [09:34:10] (03PS1) 10Muehlenhoff: Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 [09:38:47] (03PS1) 10Jcrespo: Add mysqlbinlog shortcut [puppet] - 10https://gerrit.wikimedia.org/r/277199 [09:46:48] !log depool restbase1001 / restbase1002 before deprovisioning [09:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:03] (03CR) 10Jcrespo: [C: 032] Add mysqlbinlog shortcut [puppet] - 10https://gerrit.wikimedia.org/r/277199 (owner: 10Jcrespo) [09:54:35] PROBLEM - DPKG on auth1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:56:15] RECOVERY - DPKG on auth1001 is OK: All packages OK [09:56:30] ^ auth1001 is harmless, the icinga check came in the middle of a package install [10:02:23] mobrovac: btw restbase can be deployed to restbase1010 / restbase1011 whenever [10:02:29] still not in pybal of course [10:03:30] godog: nice! i guess we can put it in pybal as soon as we do the rb deploy there? [10:03:41] mobrovac: yup [10:05:39] I'm deprovisioning restbase1001 / restbase1002 ATM [10:05:51] kk, lemme stop rb there [10:06:10] rb1002 was my favourite node, i'll have to find a new fav [10:06:47] !log restbase stopping restbase on rb100[12] before full deprovisioning [10:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:59] hehe thanks [10:07:10] godog: {{done}} [10:07:25] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2117474 (10KartikMistry) [10:07:59] (03CR) 10Mobrovac: "{{done}}. restbase100[12] have also been depooled and currently in the process of being deprovisioned." [puppet] - 10https://gerrit.wikimedia.org/r/276728 (owner: 10Mobrovac) [10:08:14] (03PS8) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [10:09:33] (03CR) 10jenkins-bot: [V: 04-1] Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [10:11:58] godog: fyi, https://gerrit.wikimedia.org/r/276728 is GTG [10:12:27] (03PS1) 10Filippo Giunchedi: cassandra: deprovision restbase100[12] [puppet] - 10https://gerrit.wikimedia.org/r/277206 (https://phabricator.wikimedia.org/T128107) [10:12:55] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [10:13:19] (03PS5) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) [10:13:26] !log restbase deployed 3bedb8f on restbase101[01].eqiad.wmnet [10:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:30] godog: also, ^^ [10:13:35] RECOVERY - Restbase root url on restbase1011 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.015 second response time [10:13:36] (03PS2) 10Filippo Giunchedi: RESTBase: Remove restbase100[12] from the lists of seeds [puppet] - 10https://gerrit.wikimedia.org/r/276728 (owner: 10Mobrovac) [10:13:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Remove restbase100[12] from the lists of seeds [puppet] - 10https://gerrit.wikimedia.org/r/276728 (owner: 10Mobrovac) [10:13:48] nice, thanks mobrovac [10:14:06] thnx 2 u godog! [10:19:38] (03CR) 10Muehlenhoff: [C: 031] "Now running on deployment-logstash2" [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [10:19:50] (03CR) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [10:21:33] (03PS1) 10Filippo Giunchedi: deprovision restbase100[12] [dns] - 10https://gerrit.wikimedia.org/r/277208 (https://phabricator.wikimedia.org/T128107) [10:21:56] (03PS2) 10Filippo Giunchedi: cassandra: deprovision restbase100[12] [puppet] - 10https://gerrit.wikimedia.org/r/277206 (https://phabricator.wikimedia.org/T128107) [10:22:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: deprovision restbase100[12] [puppet] - 10https://gerrit.wikimedia.org/r/277206 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [10:22:09] godog: fyi, i'm forcing puppet on rb*, will restart rb soon-ish there [10:23:22] mobrovac: ok, restbase1001 is halted fyi [10:23:28] kk thnx [10:27:48] !log restbase restarting restbase in prod to apply https://gerrit.wikimedia.org/r/276728 [10:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:53] mobrovac: I'll stop rb again on 1002 as it is going to error out anyway [10:30:11] godog: what do you mean "again" ? it's up? [10:30:45] it is [10:30:46] hmmm [10:31:10] godog: stopped it [10:31:41] should have masked or disabled it i guess [10:31:54] yeah puppet started it [10:31:56] Mar 14 10:19:32 restbase1002 puppet-agent[37320]: (/Stage[main]/Restbase/Service::Node[restbase]/Base::Service_unit[restbase]/Service[restbase]/ensure) ensure changed 'stopped' to 'running' [10:32:00] Mar 14 10:19:32 restbase1002 puppet-agent[37320]: (/Stage[main]/Restbase/Service::Node[restbase]/Base::Service_unit[restbase]/Service[restbase]) Unscheduling refresh on Service[restbase] [10:32:12] stinky puppet [10:32:13] haha [10:37:39] !log installing security updates for openssl, curl, gcrypt, libpng, jasper, expat and libxml2 on mw2* (along with HHVM restarts) [10:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:54] !log shut restbase100[12] [10:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:42] (03Abandoned) 10Filippo Giunchedi: varnish: leave only one upload backend for swift [puppet] - 10https://gerrit.wikimedia.org/r/276523 (owner: 10Filippo Giunchedi) [10:42:19] (03PS1) 10Filippo Giunchedi: conftool: swap restbase100[12] with restbase101[01] [puppet] - 10https://gerrit.wikimedia.org/r/277210 (https://phabricator.wikimedia.org/T128107) [10:43:01] If anyone is not busy with getting the job queue under control, I'd like some more eyes on https://gerrit.wikimedia.org/r/#/c/274711/. If anyone has a few minutes for a review [10:43:44] (03PS8) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [10:44:19] (03PS2) 10Filippo Giunchedi: deprovision restbase100[12] [dns] - 10https://gerrit.wikimedia.org/r/277208 (https://phabricator.wikimedia.org/T128107) [10:44:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deprovision restbase100[12] [dns] - 10https://gerrit.wikimedia.org/r/277208 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [10:47:21] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2117510 (10fgiunchedi) Chris, restbase100[12] have been halted and they can be repurposed modulo SSDs which we'll need for other restbase machines in this ticket, let me know when ssds h... [10:47:51] (03PS6) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [10:47:56] (03CR) 10Elukey: "After checking "top" on mediawiki-vagrant I noticed 100% CPU utilization even when vk was not doing anything. I added 0.1s of sleep time b" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [10:49:56] gehel: if you have ran the puppet compiler already on some hosts do you have the results? [10:50:41] godog: running it right now [10:51:29] (03CR) 10Gehel: "Puppet compiler results: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/2045/" [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [10:55:29] (03PS7) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [10:56:19] !log installing rsync security updates [10:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:19] (03CR) 10Gehel: "Re-ran puppet-compiler: https://puppet-compiler.wmflabs.org/2046/" [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [10:59:35] (03CR) 10Filippo Giunchedi: [C: 031] Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [11:01:20] (03PS1) 10ArielGlenn: limit find to type file for stats log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/277214 [11:04:11] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2117541 (10matmarex) So…... [11:05:03] 6Operations, 13Patch-For-Review: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2117544 (10elukey) [11:05:05] 6Operations, 13Patch-For-Review: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2117543 (10elukey) 5Open>3Resolved [11:15:02] (03PS9) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [11:15:03] 6Operations, 6Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2117559 (10elukey) Status update: * Tested the code in mediawiki-vagrant and discovered some bugs, fixed in the latest patchset. Now varnishkafk... [11:22:17] gehel: a question re: ssl and elasticsearch since I think I missed it, how did you do with the certs and the fact that the hosts are contacted via search.svc.eqiad.wmnet ? [11:28:55] godog: I'm planning on re-generating the puppet certs with a SAN=search.svc.eqiad.wmnet (https://wikitech.wikimedia.org/wiki/Puppet#SANs_for_puppet_certs). Good point, I could already add the appropriate hiera config for that. [11:29:38] godog: at the moment, mediawiki accesses elasticsearch via IP (probably a micro optimization), but I do not want to change that before we have HTTP pooling in place. [11:31:32] gehel: nice, thanks, yea the SAN is easy enough, I don't remember why address vs dns name tho [11:33:31] jynus: foundationwiki_p has replag [11:34:32] nope: https://tools.wmflabs.org/replag/ [11:34:44] now, if you miss records [11:35:05] you should check labs mail or topic [11:35:55] 6Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2117602 (10fgiunchedi) [11:37:04] (03PS2) 10Mobrovac: Mathoid: enable PNG generation [puppet] - 10https://gerrit.wikimedia.org/r/276734 (https://phabricator.wikimedia.org/T71702) [11:37:39] (03CR) 10Mobrovac: Mathoid: enable PNG generation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/276734 (https://phabricator.wikimedia.org/T71702) (owner: 10Mobrovac) [11:38:27] 6Operations, 6Commons, 10MediaWiki-Page-deletion, 10media-storage: MWException trying to delete a certain file on Commons - https://phabricator.wikimedia.org/T129637#2117620 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi [11:38:42] MariaDB [foundationwiki_p]> SELECT UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) FROM recentchanges; [11:38:42] +------------------------------------------------------+ [11:38:42] | UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) | [11:38:42] +------------------------------------------------------+ [11:38:42] | 28832.000000 | [11:38:43] +------------------------------------------------------+ [11:38:43] 1 row in set (0.01 sec) [11:39:11] s1 has lags. ok. [11:40:32] https://phabricator.wikimedia.org/P2757 [11:41:17] Steinsplitter^ [11:41:53] please stick to labs for labs issues [11:41:58] or non-issues [11:42:22] (03PS1) 10ArielGlenn: fix up argument for syslog reload [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/277220 [11:42:25] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:07] pt-heartbeat has very little issues [11:43:25] that is what it is being used for lag measurement both on labs and in production [11:43:38] (in fact, the setup is shared) [11:44:36] ah :) [11:45:30] as for enwiki, I can tell you exactly why it is happening [11:46:09] wich is someone combining myisam tables and production tables in a subquery, blocking replication [11:46:43] the policy is to do as few of those as possible [11:47:07] !log installing security updates for openssl, curl, gcrypt, libpng, jasper, expat and libxml2 on remaining mw1* app servers (along with HHVM restarts) [11:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:14] (03PS2) 10Filippo Giunchedi: conftool: swap restbase100[12] with restbase101[01] [puppet] - 10https://gerrit.wikimedia.org/r/277210 (https://phabricator.wikimedia.org/T128107) [11:59:16] (03PS1) 10Filippo Giunchedi: swift: remove object POST stats from grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/277223 [12:05:06] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117648 (10Joe) Other findings: * Terbium ran refreshLink cron jobs between Mar 1st and Mar 8th; so this is not caused by those cronjobs * The job... [12:09:25] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:10:14] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117655 (10Joe) As a backpressure measure, I proposed to move the refreshLinks Jobs to a separate, dedicated queue, as it has been done for htmlCac... [12:16:15] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2107949 (10mobrovac) >>! In T129517#2117655, @Joe wrote: > As a backpressure measure, I proposed to move the refreshLinks Jobs to a separate, dedic... [12:18:22] 6Operations: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#2117668 (10MoritzMuehlenhoff) [12:20:40] 6Operations: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#2117686 (10MoritzMuehlenhoff) Restarting the salt minion allows the minion to reconnect [12:26:32] 6Operations: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#2117727 (10ArielGlenn) p:5Triage>3Normal a:3ArielGlenn [12:34:43] 6Operations: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#2117738 (10ArielGlenn) For right now I'm doing a restart on all the broken ones except the snapshot hosts (leaving a few so I can poke at them later). [12:37:00] (03PS1) 10Reedy: Disable OAI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277229 (https://phabricator.wikimedia.org/T70867) [12:37:22] (03PS2) 10Reedy: Disable OAI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277229 (https://phabricator.wikimedia.org/T70867) [12:39:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [12:41:32] (03CR) 10Mobrovac: [C: 031] conftool: swap restbase100[12] with restbase101[01] [puppet] - 10https://gerrit.wikimedia.org/r/277210 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [12:53:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [12:57:58] !log removed kafka1001 from eventbut's pool via confd [12:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:31] (03PS3) 10Filippo Giunchedi: conftool: swap restbase100[12] with restbase101[01] [puppet] - 10https://gerrit.wikimedia.org/r/277210 (https://phabricator.wikimedia.org/T128107) [13:07:33] (03PS2) 10Filippo Giunchedi: swift: remove object POST stats from grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/277223 [13:08:52] (03PS8) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [13:10:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: remove object POST stats from grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/277223 (owner: 10Filippo Giunchedi) [13:10:57] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [13:13:13] !log re-added kafka1001 to the eventbus confd pool after maintenance [13:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:27] !log removed kafka1002.eqiad.wmnet from eventbus' pool via confd for maintenance [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:41] !log re-added kafka1002 to the eventbus confd pool after maintenance [13:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:52] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117805 (10hashar) I have added a graph in Grafana dashboard for the job queue rate that shows up the per hour addition/removal of jobs https://gra... [13:21:11] (03PS9) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [13:21:44] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2117806 (10PeterBowman) I launched a `forcerecursivelinkupdate` purge API action on 03/10 at 02:40 UTC affecting all pages on plwiktionary, a small... [13:24:23] 6Operations, 7LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#2117810 (10Ottomata) There aren’t any special LDAP groups for Analytics, but hue.wikimedia.org does authenticate logins via LDAP (shell username and... [13:26:43] (03CR) 10Ottomata: [C: 031] limit find to type file for stats log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/277214 (owner: 10ArielGlenn) [13:29:51] (03PS2) 10ArielGlenn: limit find to type file for stats log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/277214 [13:31:47] (03CR) 10ArielGlenn: [C: 032] limit find to type file for stats log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/277214 (owner: 10ArielGlenn) [13:37:46] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:50:45] PROBLEM - DPKG on mx1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:51:14] PROBLEM - Disk space on mx2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied [13:54:45] RECOVERY - Disk space on mx2001 is OK: DISK OK [13:55:47] 6Operations, 10Salt: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#2117872 (10ArielGlenn) [13:56:40] 6Operations, 10Salt: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#2117668 (10ArielGlenn) The restart job is done, leaving only these as non-responsive (snapshot hosts left out deliberately for investigation): ms-fe1004.eqiad.wmnet: bast2001.wikimedia.org:... [13:58:24] PROBLEM - DPKG on mendelevium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:58:25] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Puppet has 2 failures [13:59:44] RECOVERY - DPKG on mx1001 is OK: All packages OK [14:00:25] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:01:33] (03PS4) 10Filippo Giunchedi: conftool: swap restbase100[12] with restbase101[01] [puppet] - 10https://gerrit.wikimedia.org/r/277210 (https://phabricator.wikimedia.org/T128107) [14:01:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] conftool: swap restbase100[12] with restbase101[01] [puppet] - 10https://gerrit.wikimedia.org/r/277210 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [14:02:05] RECOVERY - DPKG on mendelevium is OK: All packages OK [14:05:24] PROBLEM - Host restbase1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:24] PROBLEM - Host restbase1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:25] PROBLEM - Disk space on mendelevium is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied [14:06:39] 6Operations: conftool-merge should report which node is setting attributes for - https://phabricator.wikimedia.org/T129847#2117886 (10fgiunchedi) [14:16:29] !log pool restbase101[01] [14:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:24] (03CR) 10Elukey: "Side note: I found references of mw2090 also in:" [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) (owner: 10Giuseppe Lavagetto) [14:17:26] mobrovac: ^ [14:17:54] RECOVERY - Disk space on mendelevium is OK: DISK OK [14:17:57] godog: \o/ [14:27:50] (03CR) 10Giuseppe Lavagetto: "right, we should remove it from conftool as well, while mediawiki needs to be installed there indeed" [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) (owner: 10Giuseppe Lavagetto) [14:27:53] !log pool restbase1009 [14:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:19] !log re-imaging mw2090.codfw for T126987 [14:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:55] 6Operations, 7Documentation, 7LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#2118051 (10Krenair) [14:50:27] <_joe_> elukey: don't run puppet on the machine though [14:50:51] <_joe_> elukey: too late? [14:51:04] _joe_ yep yep I know, I'll just accept keys :) [14:51:15] and make sure that puppet is disabled [14:51:27] <_joe_> tthanks :) [14:52:44] (03PS7) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [14:55:32] (03CR) 10Giuseppe Lavagetto: "Noop on terbium" [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) (owner: 10Giuseppe Lavagetto) [14:55:42] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) (owner: 10Giuseppe Lavagetto) [14:56:40] 6Operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2118126 (10mobrovac) [15:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160314T1500). [15:00:04] matt_flaschen kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:13] Present [15:00:45] I can SWAT. [15:00:54] * kart_ too present [15:01:12] hey [15:01:18] we might have one more patch [15:01:31] okie doke. [15:02:52] jouncebot: why the weird spacing around parentheses [15:03:48] Nemo_bis, I think it just pulls that from the page [15:03:53] Nemo_bis: because on the wiki page it's a line break not a space between them [15:03:54] might not recognise the newline [15:04:15] (03PS10) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [15:05:14] Hi. [15:05:54] Dereckson: hello [15:06:19] kart_: l10n updates? :( [15:10:36] thcipriani: sadly. [15:10:47] full scap needed. [15:10:56] kart_: ack. yeah. [15:11:04] Go ahead with other patches first please. [15:12:12] thcipriani: You'll need a core pull-through of the VE extension change, BTW. :-( [15:13:16] James_F: what does that mean? (it's early :)) [15:13:48] thcipriani: Normally when you +2 a patch in wmf/… for extension/Foo, there's an automatic commit made in mediawiki/core's wmf/… for it. [15:14:05] thcipriani: But that magic doesn't work for extension/VisualEditor. [15:14:18] So it needs to be made manually. [15:14:22] ah, gotcha. [15:14:36] Everyone always forgets because it's just VE that's affected. [15:16:06] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2118411 (10hashar) I have been looking at refreshLink job on `frwiki`. It has been triggered by an edit on the page `Module:Suivi_des_biographies/d... [15:16:41] !log thcipriani@tin Synchronized php-1.27.0-wmf.16/extensions/Echo: SWAT: thank-you-edit: canRender for deleted page and extra fix [[gerrit:276916]] (duration: 00m 39s) [15:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:47] How do we monitor expiration of SSL certificates? As I'm adding SSL to elasticsearch, I'd like to make sure we won't have issues in the long term... [15:16:48] ^ matt_flaschen check please [15:17:07] just got a mw2090.codfw.wmnet returned [255]: Permission denied [15:17:21] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2118412 (10Gehel) 'm probably missing something obvious, but there is something I do not understand about our hiera hierarchy as part of ch... [15:17:21] is that a known? [15:17:59] thcipriani: elukey was reimaging it, so yeah [15:18:02] gehel: icinga polls for cert expiration, for externally-visible https anyway [15:18:31] godog: in the case of elasticsearch, it is internal only. Should I do anything about it? [15:19:37] thcipriani: yes I am reimaging it to be the codfw terbium (T126987) [15:19:48] elukey: ack, thanks. [15:21:13] gehel: not sure, what's the expiration? [15:22:23] godog: I'd need to check (I still need to regenerate the certs). I'm reusing the puppet certs, so whatever expiration we put on puppet certs. And we'll have puppet broken if those certs expire... [15:22:45] thcipriani, in progress. I have to wait to receive the notification then if that works (it should) delete the page to try to break it. [15:23:44] Also, I'm having issues understanding our hiera hierarchy. If anyone has time to have a look at https://phabricator.wikimedia.org/T124444#2118412 [15:28:06] !log rebooting nobelium for kernel update [15:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:47] Actually I can't fully test it yet, but it should still fix it for anyone that already has the notification. [15:29:57] James_F: blerg. trying to figure out how to push this to gerrit without sending up security patches :( [15:30:13] thcipriani: Want me to make one on my machine? [15:30:36] James_F: yeah, if you could. I'm currently doing a bunch of local checkouts—that'd take some time. [15:30:38] Kk. [15:33:19] (03PS1) 10Andrew Bogott: Add designateenv.sh script and refactor novaenv.sh classes [puppet] - 10https://gerrit.wikimedia.org/r/277255 [15:33:33] thcipriani: https://gerrit.wikimedia.org/r/277256 [15:34:23] James_F: thanks! [15:34:39] (I have a text file with the pre-built commands and commit summaries because it happens fairly often. :-)) [15:35:01] fancy :) [15:37:05] 7Puppet, 6Revision-Scoring-As-A-Service, 10ores, 13Patch-For-Review: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#2118508 (10Halfak) @Ladsgroup, it doesn't look like this is done. What's the status? [15:37:13] (03CR) 10Ottomata: "Hm, this is different than Faidon's suggestion here:" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/277220 (owner: 10ArielGlenn) [15:37:48] 6Operations, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (10Ottomata) Ariel submitted this https://gerrit.wikimedia.org/r/#/c/277220/ , but perhaps did not know about this ticket [15:38:06] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2118521 (10Gehel) [15:41:30] is there any way to check the CPU usage of bohrium over time? It looks like it's usually under heavy use, but I don't have historical context. [15:41:50] (03CR) 10ArielGlenn: "his looks like the right permanent fix. this just addresses the usage message, so short-term fix." [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/277220 (owner: 10ArielGlenn) [15:41:55] milimetric: ganglia? [15:42:23] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Miscellaneous+eqiad&h=bohrium.eqiad.wmnet&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [15:42:32] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=bohrium.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS [15:42:42] Jumped up on the 10th it looks like [15:42:46] oh cool, thanks! [15:42:57] yeah, that makes sense, new site deployed [15:42:58] (03PS10) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [15:43:42] thcipriani: scap'ng? [15:43:53] kart_: not yet :( [15:44:19] (03CR) 10jenkins-bot: [V: 04-1] Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [15:45:25] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 10Traffic, 13Patch-For-Review: Fix puppet on deployment-cache* hosts in beta cluster - https://phabricator.wikimedia.org/T129270#2118591 (10greg) [15:46:40] (03PS1) 10Muehlenhoff: Add ferm rules for DNS auth servers [puppet] - 10https://gerrit.wikimedia.org/r/277258 [15:46:46] kart_: waiting on a core change, then sync file, then start scap. deployment calendar is fairly empty after morning SWAT, this will run over time. [15:51:18] thcipriani: thanks! [15:52:14] (03PS11) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [15:53:05] James_F: the only update in the submodule bump is to extensions/VisualEditor/lib/ve/src/ce/nodes/ve.ce.TableNode.js correct? [15:53:13] thcipriani: Yes. [15:53:57] (03PS1) 10Cmjohnson: Adding dns entries for new snapshot hosts (1005-1007) [dns] - 10https://gerrit.wikimedia.org/r/277259 [15:54:22] <_joe_> !log repooling mw1128 [15:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:10] !log thcipriani@tin Synchronized php-1.27.0-wmf.16/extensions/VisualEditor/lib/ve/src/ce/nodes/ve.ce.TableNode.js: SWAT: Update VE core submodule to wmf/1.27.0-wmf.16 HEAD [[gerrit:277183]] (duration: 00m 30s) [16:00:12] ^ James_F check please! [16:00:13] (03CR) 10Elukey: "FYI: https://phabricator.wikimedia.org/T129344" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/277220 (owner: 10ArielGlenn) [16:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:02] (03Abandoned) 10Alex Monk: horizon: Add dynamicproxy IPs to config [puppet] - 10https://gerrit.wikimedia.org/r/276893 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [16:01:49] thcipriani: Hmm. [16:01:53] <_joe_> !log repooling mw1141 [16:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:34] thcipriani: Doesn't quite seem fixed. Flushing cache. [16:02:40] (03CR) 10Alex Monk: Add designateenv.sh script and refactor novaenv.sh classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277255 (owner: 10Andrew Bogott) [16:03:55] thcipriani: Yup, all good. [16:04:01] James_F: cool, thanks! [16:04:51] !log thcipriani@tin Started scap: Better announce new optional MT services available [[gerrit:277195]] [16:04:54] ^ kart_ [16:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:08] cool. [16:05:17] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-69-k, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2118655 (10dr0ptp4kt) [16:06:52] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-68-j, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2069794 (10dr0ptp4kt) [16:07:54] (03CR) 10Andrew Bogott: Add designateenv.sh script and refactor novaenv.sh classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277255 (owner: 10Andrew Bogott) [16:13:11] (03CR) 10Alex Monk: "I think we should rename the script to something else, since this won't be a designate-wide admin (like novaadmin is of nova), just own th" [puppet] - 10https://gerrit.wikimedia.org/r/277255 (owner: 10Andrew Bogott) [16:13:36] 6Operations, 10ops-codfw: labstore2003-labstore2004 onsite setup taks - https://phabricator.wikimedia.org/T128764#2118716 (10Papaul) [16:14:26] (03CR) 10Andrew Bogott: "Yes, that's true... what name would you like? wmflabsorgenv.sh?" [puppet] - 10https://gerrit.wikimedia.org/r/277255 (owner: 10Andrew Bogott) [16:15:06] (03CR) 10Alex Monk: "wmflabsorg-domainadminenv.sh" [puppet] - 10https://gerrit.wikimedia.org/r/277255 (owner: 10Andrew Bogott) [16:16:56] (03PS2) 10Andrew Bogott: Add designateenv.sh script and refactor novaenv.sh classes [puppet] - 10https://gerrit.wikimedia.org/r/277255 [16:20:24] (03PS3) 10Andrew Bogott: Add wmflabsorg-domainadminenv.sh script and refactor novaenv.sh classes [puppet] - 10https://gerrit.wikimedia.org/r/277255 [16:22:25] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2118764 (10EBernhardson) It looks like this was figured out? I'm seeing http://puppet-compiler.wmflabs.org/2048/ which looks to have dns_al... [16:24:04] (03CR) 10EBernhardson: Expose elasticsearch through HTTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [16:24:28] (03PS1) 10Eevans: dependencies needed for logstash filtering [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/277264 (https://phabricator.wikimedia.org/T128787) [16:25:43] (03CR) 10Gehel: Expose elasticsearch through HTTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [16:26:35] (03PS11) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [16:28:24] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:05] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [16:31:08] !log thcipriani@tin Finished scap: Better announce new optional MT services available [[gerrit:277195]] (duration: 26m 17s) [16:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:14] ^ kart_ done! [16:32:17] finally. [16:33:28] Working, l10n msg blinks. [16:33:57] kart_: cool, thanks [16:34:50] (03CR) 10EBernhardson: [C: 031] "puppet compiler output looks good to me, basic review of the code also looks right" [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [16:35:22] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2118858 (10Gehel) Strange ... I can't find what I would have changed to fix this. Maybe I just had the changes locally and forgot to push t... [16:36:28] (03PS1) 10Eevans: Filter StatusLogger messages from UDP appender [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) [16:37:30] (03CR) 10Eevans: [C: 04-1] "I'm am -1'ing this until https://gerrit.wikimedia.org/r/#/c/277264 is applied, and the jars are fully deployed." [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [16:39:14] !log deployed fix for scribunto issue related to T110143 [16:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:15] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [16:48:29] (03CR) 10Mobrovac: [C: 031] dependencies needed for logstash filtering [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/277264 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [16:49:30] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2118921 (10Whatamidoing-WMF) [16:57:14] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2118958 (10RobH) a:5RobH>3elukey Ops meeting update: Approved, as long as Luca/Andrew (analytics) have no objections. (One... [17:07:06] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:10:25] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2119025 (10elukey) Approved! [17:10:34] _joe_: Could you help me with a few minor changes? [17:10:43] Actually, just https://gerrit.wikimedia.org/r/#/c/276754/ [17:10:58] (03PS2) 10Krinkle: Update outdated monitoring url for http_bits [puppet] - 10https://gerrit.wikimedia.org/r/276754 [17:11:49] <_joe_> Krinkle: in a meeting [17:12:34] Krinkle: i'll do the monitoring change [17:12:40] Thx [17:13:16] (03CR) 10Dzahn: [C: 032] "yes - https://bits.wikimedia.org/en.wikipedia.org/load.php" [puppet] - 10https://gerrit.wikimedia.org/r/276754 (owner: 10Krinkle) [17:14:30] (03PS3) 10Krinkle: Remove unused static symlinks for beta php-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) [17:15:13] (03CR) 10Krinkle: [C: 032] Remove unused static symlinks for beta php-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [17:15:48] (03Merged) 10jenkins-bot: Remove unused static symlinks for beta php-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [17:19:58] Krinkle: while the check command is fine, it looks like no service currently uses that command [17:20:40] on prod icinga that is [17:21:41] but since it's in nagios_common it might be used by shinken.. yes [17:27:29] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2119140 (10RobH) IRC Note: > elukey: Andrew approved, already talked with him! [17:27:39] (03PS2) 10RobH: add bearloga & ironholds to analytics-search-user [puppet] - 10https://gerrit.wikimedia.org/r/276190 (https://phabricator.wikimedia.org/T129260) [17:30:31] (03CR) 10RobH: [C: 032] "approval was granted in ops meeting and with followup on task." [puppet] - 10https://gerrit.wikimedia.org/r/276190 (https://phabricator.wikimedia.org/T129260) (owner: 10RobH) [17:32:24] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2119175 (10RobH) 5Open>3Resolved a:5elukey>3None This was approved in today's operations meeting, and then followup appr... [17:32:34] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [17:33:17] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [17:33:26] o.O tin and mira? [17:40:04] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - TiNet {#1065}BR [17:41:51] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [17:42:46] andrewbogott: dns?^ is that you [17:43:33] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 802763 bytes in 8.863 second response time [17:43:56] well, I restarted pdns again with pretty immediate effect, I'm guessing this is the ongoing DNS issue [17:52:13] ! log restart pdns on labservices1001 [17:54:15] chasemp: extra space [17:54:36] 6Operations, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2119283 (10elukey) A bit of detail from cp1052: ``` elukey@cp1052:~$ ls -l /etc/logrotate.d/varnishkafka* -r--r--r-- 1 root root 174 Mar 2 12:02 /etc/logrotate.d/varnishkafka -r--r--r-- 1 root root 222 Aug 31... [17:56:05] 17:32 < icinga-wm> PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). ? [17:56:59] (03PS3) 10Krinkle: Move /w/static to /static (keeping symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276377 [17:57:42] Krinkles change for beta wasn't pulled onto tin [17:57:55] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [17:58:21] 6Operations, 10ops-eqiad: ms-fe1004 off the network - https://phabricator.wikimedia.org/T129896#2119306 (10fgiunchedi) [18:00:36] !log restart pdns on labservices1001 :) [18:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:25] jynus: yt? [18:02:23] yes [18:03:04] 6Operations, 10ops-eqiad: ms-fe1004 off the network - https://phabricator.wikimedia.org/T129896#2119338 (10fgiunchedi) also I don't see the port listed in librenms here https://librenms.wikimedia.org/device/device=67/tab=ports/ and the respective port down alert in "alerts" tab [18:04:44] (03PS2) 10Krinkle: Update file-system references from /w/static to /static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276392 [18:07:14] (03CR) 10Krinkle: [C: 032] Move /w/static to /static (keeping symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276377 (owner: 10Krinkle) [18:07:42] (03CR) 10Krinkle: [C: 032] Update file-system references from /w/static to /static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276392 (owner: 10Krinkle) [18:07:48] (03Merged) 10jenkins-bot: Move /w/static to /static (keeping symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276377 (owner: 10Krinkle) [18:08:00] Reedy: thx [18:08:20] (03Merged) 10jenkins-bot: Update file-system references from /w/static to /static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276392 (owner: 10Krinkle) [18:11:45] <_joe_> AaronSchulz: are you around? [18:12:15] * AaronSchulz lurks [18:12:26] <_joe_> In case I would need you to take a look at https://phabricator.wikimedia.org/T129517 and specifically an opinion on my proposal to trash the backlog of refreshlinks :) [18:12:47] nuria ? [18:12:55] jynus: hola [18:13:00] hi there [18:13:04] <_joe_> AaronSchulz: so the mystery is that we're enqueueing refreshLinks jobs at a rate that is 5x the normal rate, but only on specific wikis it seems [18:13:27] jynus: i was wondering if you have an idea when you could get to add the autoincrement field to eventlogging db [18:13:33] <_joe_> or maybe it's just specific wikis that built a backlog, anyways databases feel the load and the jobqueue is still enormous [18:13:43] jynus: that way after we can fix teh replication script [18:14:03] <_joe_> AaronSchulz: and it seems you're the only one able to understand what's really going on :) [18:14:17] yes, in theory, with all tables on tokudb, that should be "hot" (I would have to check it) [18:14:23] jynus: https://phabricator.wikimedia.org/T125135 [18:14:38] so we just need to generate a list of tables missing, and apply the schema change [18:14:46] I don't know what's adding them either...I was thinking about what kind of logging would be useful [18:14:53] <_joe_> that too [18:15:12] <_joe_> trimming down the queue could help maybe in understanding what's going on? [18:15:30] jynus: right, and change replication script and monitor it all goes well [18:15:49] <_joe_> I've prepared a bash script to purge the redises in case from all the refreshlinks keys in specific wikis [18:15:52] (03Abandoned) 10Dzahn: salt-misc: set $SSH to $(which ssh) vs manual setup [software] - 10https://gerrit.wikimedia.org/r/276847 (owner: 10Dzahn) [18:15:52] jynus: on our end that allows us to process events in parallel (now we have to do it sequentially) [18:16:06] (03Abandoned) 10Dzahn: salt-misc: make bastion host configurable [software] - 10https://gerrit.wikimedia.org/r/276882 (owner: 10Dzahn) [18:17:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [18:19:15] huh, plwiktionary queue is cleared out already [18:19:38] * AaronSchulz looks at frwiktionary instead now [18:20:11] yes, I told ottomata that, even if he was the one creating the issue [18:20:24] I am backfilling the events because of that [18:21:00] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2103523 (10Dzahn) @greg i'm the on-duty ops guy this week, feel free to give it to me if it's ready to move forward [18:21:02] ehhh? [18:21:03] <_joe_> AaronSchulz: ping me if you have any actionables later, I'm going off for now (it's almost 8 pm) [18:21:47] (oh carry on) [18:22:57] sorry, I pinged you by accident [18:23:56] 6Operations, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2119487 (10Ottomata) I am remembering things...... When @akosiaris and I packaged this thing, we had a lot of trouble getting logging with packaging to work properly. Perhaps you are right, that rsyslog stuff... [18:24:55] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:27:05] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2119494 (10greg) I'm working through some ideas of on "what should new SWAT members know/be familiar with and what can they not know and still be OK?". B... [18:28:19] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2119515 (10greg) [18:31:46] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:33:45] (03PS4) 10Andrew Bogott: Add wmflabsorg-domainadminenv.sh script and refactor novaenv.sh classes [puppet] - 10https://gerrit.wikimedia.org/r/277255 [18:34:22] _joe_: https://phabricator.wikimedia.org/T129317 1 or 2, or half? [18:34:28] (half of all?) [18:34:43] oh, there are four [18:34:45] right [18:38:54] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for new snapshot hosts (1005-1007) [dns] - 10https://gerrit.wikimedia.org/r/277259 (owner: 10Cmjohnson) [18:39:54] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [18:40:48] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2119592 (10Joe) [18:41:38] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Dedicate 1/2 codfw jobrunners to gwtoolset jobs - https://phabricator.wikimedia.org/T129317#2102170 (10Joe) We should un-dedicate specific jobrunners to gwtoolset in eqiad instead, as the memleak doesn't show up at the moment [18:41:42] <_joe_> Krinkle: ^^ [18:41:56] <_joe_> thanks to hhvm 3.12, supposedly [18:42:46] (03CR) 10Andrew Bogott: [C: 032] Add wmflabsorg-domainadminenv.sh script and refactor novaenv.sh classes [puppet] - 10https://gerrit.wikimedia.org/r/277255 (owner: 10Andrew Bogott) [18:43:14] !log krinkle@tin Synchronized static/: Moved /srv/mediawiki/w/static to /srv/mediawiki/static (duration: 00m 29s) [18:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:24] !log krinkle@tin Synchronized w: Replace /w/static with symlink (duration: 00m 30s) [18:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:40] 6Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2119605 (10MoritzMuehlenhoff) Updated packages for 2.2 are now available in the cloud archive. @Andrew , what your preference; downgrading 1005/1010 to 2.0 or upgrading the rest to 2.2? IMO it makes sense... [18:51:49] (03PS1) 10Andrew Bogott: Use proper novaconfig[] settings in environment scripts. [puppet] - 10https://gerrit.wikimedia.org/r/277311 [18:56:06] !log Starting Cassandra repairs on restbase1007-a.eqiad.wmnet : T108611 [18:56:07] T108611: perform initial (manual) repair of Cassandra cluster - https://phabricator.wikimedia.org/T108611 [18:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:40] (03PS2) 10Krinkle: Remove unused wmf-deployment symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276221 (owner: 10Ori.livneh) [18:57:09] (03CR) 10Andrew Bogott: [C: 032] Use proper novaconfig[] settings in environment scripts. [puppet] - 10https://gerrit.wikimedia.org/r/277311 (owner: 10Andrew Bogott) [18:59:39] (03CR) 10Krinkle: [C: 032] "Staging on mw1017 to be sure." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276221 (owner: 10Ori.livneh) [19:00:02] (03Merged) 10jenkins-bot: Remove unused wmf-deployment symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276221 (owner: 10Ori.livneh) [19:01:15] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:06:14] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-68-"Java and JavaScript are basically the same", and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getT... - https://phabricator.wikimedia.org/T124356#2119664 [19:09:23] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2119669 (10Ottomata) a:5Ottomata>3RobH > If we can fit more cores per system, is there any benefit to lowering this cluster size from 3 to 2? Let's stick with 3. > Do you want identical du... [19:09:32] !log krinkle@tin Synchronized docroot/mediawiki/: Update static symlink (duration: 00m 29s) [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:28] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2119671 (10Ottomata) a:5RobH>3None Bump, we should use one of the spare 2623 V3s for this. [19:10:44] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) Ok, IRC update after chatting with @Ottomata Since this is a new service, the ideal core to memory ratio is largely unknown. At this point, its estimated that a dual cpu syst... [19:10:46] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:11:12] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2119680 (10Ottomata) Bump! :) [19:12:55] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2119707 (10Ottomata) Also bump! [19:16:47] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2119727 (10RobH) a:3mark Those are the only in warranty spare in eqiad with 32GB or greater memory. I'm escalating this request to @mark for his approval. This is a request for a single... [19:16:50] (03PS1) 10Andrew Bogott: Define wmflabsdotorg_project for labtest [puppet] - 10https://gerrit.wikimedia.org/r/277322 [19:17:27] !log krinkle@tin Synchronized docroot/: Update static symlinks (duration: 00m 28s) [19:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:04] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-68-"Java and JavaScript are basically the same", and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getT... - https://phabricator.wikimedia.org/T124356#2119741 [19:18:39] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2119744 (10matmarex) The gook was live for a while, but then was removed: * {530722ea26a24360561271b547d3086a03134a5e} * {c44425db355c64a55775053726d31b246bfb9549} * {07f0e35050... [19:18:51] (03CR) 10Andrew Bogott: [C: 032] Define wmflabsdotorg_project for labtest [puppet] - 10https://gerrit.wikimedia.org/r/277322 (owner: 10Andrew Bogott) [19:20:37] ? [19:21:13] (03PS12) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [19:23:22] (03CR) 10Gehel: [C: 032] Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [19:25:59] aude: where is the log in kibana for scheduleRefreshLinks()? The class name doesn't seem to work as the channel, but I know there must be events. [19:30:14] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Puppet has 1 failures [19:30:25] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Puppet has 1 failures [19:30:45] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet has 3 failures [19:31:25] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 2 failures [19:31:35] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Puppet has 1 failures [19:31:50] AaronSchulz: wikibase.client.pageupdates.RefreshLinksJob [19:31:54] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures [19:31:55] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: Puppet has 1 failures [19:32:16] PROBLEM - puppet last run on elastic1010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:32:45] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:32:49] Puppet is going to be in error on all elasticsearch servers. I'm going to try to roll it forward instead of rollingback (I missed that path to Puppet SSL certificates are not the same in production) [19:33:00] hoo: the wfDebugLog() call? [19:33:19] I'm the one responsible for the incinga alerts above... [19:33:20] oh, that as well [19:33:32] Yeah, we have that as well [19:33:41] but it's also in graphite [19:34:45] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:14] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:15] PROBLEM - puppet last run on elastic1011 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:15] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Puppet has 3 failures [19:35:15] PROBLEM - puppet last run on elastic1014 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:25] PROBLEM - DPKG on logstash1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:35:35] PROBLEM - puppet last run on elastic1005 is CRITICAL: CRITICAL: Puppet has 1 failures [19:36:11] hoo: how do I see the titles in kibana? what query? [19:38:50] AaronSchulz: Oh, Kibana… don't think we push that [19:38:56] but would be trivial to turn on [19:49:59] AaronSchulz: Yeah, logging to logstash is based on a whitelist of channels [19:50:11] the default catch-all channel is only enabled for testwiki [19:50:41] hoo: would be nice if the channel didn't use __CLASS__ and was enabled. Though the later is more important ;) [19:50:53] It should probably use a more generic channel name and pass the class as part of the message or context [19:50:54] * AaronSchulz has to go for a bit [19:51:09] AaronSchulz: Do you only care for refreshlinks? [19:51:15] or the other update types as well? [19:51:25] just refreshlinks [19:51:28] ok [19:51:46] I'll see what I can do… cu [19:55:16] <_joe_> AaronSchulz: how are your links then? Fresh enough? :P [19:56:35] (03PS1) 10Gehel: Puppet SSL dir is not the same on Production or Labs [puppet] - 10https://gerrit.wikimedia.org/r/277329 (https://phabricator.wikimedia.org/T124444) [19:58:35] Is there anyone who understand something about the difference between Labs and Production regarding where we deploy Puppet certificates? I need a second pair of eyes on https://gerrit.wikimedia.org/r/#/c/277329/ [19:59:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160314T2000). Please do the needful. [20:00:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [20:01:59] no mobileapps deploy today [20:02:55] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Puppet has 1 failures [20:03:04] no parsoid deploy either [20:04:11] ACKNOWLEDGEMENT - puppet last run on alsafi is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn testing ganglia-aggregator service on jessie [20:14:25] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:15:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:19:12] o/ akosiaris [20:19:21] Got a few minutes to give me an update on ORES server work? [20:20:07] Looking at https://phabricator.wikimedia.org/T106867 [20:20:44] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [20:22:14] Relatedly, I've been thinking about using git LFS to manage our larger binaries. [20:22:30] Usually I'd bounce these ideas off yuvipanda, but he's on vacation :) [20:23:06] today is a bank holiday here, you might try tomorrow [20:23:20] unless he's suddenly shown up active I mean [20:23:33] apergos, talking about akosiaris? [20:23:37] yep [20:23:48] Gotcha. Thanks for the info. [20:23:52] sure thing [20:24:16] apergos, any thoughts on git Large File Storage? [20:24:39] In revscoring, we have big model file blobs and some python dependency blobs that I'd like to version along with code. [20:24:44] no, and I probably should have [20:24:57] I have no experience with it however [20:24:58] Gotcha. no worries :) [20:25:20] * halfak would totally a use (abuse?) an ops Q&A service. [20:25:30] wouldn't we all [20:25:31] Should I or is that a bad idea. [20:25:47] "stack overflow" [20:25:49] *cough* [20:26:17] maybe someone else on the team has played with it or used it at $job-1 [20:26:34] Yeah, but I want to know what ya'll think. It's nice to have someone who knows our systems say, "Yeah, that'll work nicely." or "That's a good idea, but it'll be hard for us." [20:26:43] halfak: GitHub uses it, if you want to play with it. [20:26:58] Matthew_, yeah. Been playing with it. Seems to work OK. [20:26:58] yeah I noticed they'd rolled that out [20:27:04] but I never got around to messing with it [20:27:25] That's good. I don't have much experience with it myself but... [20:27:34] (03CR) 10RobH: [C: 031] "looks like it is small in scope (and already not functioning on live systems.) puppet_ssldir call seem right." [puppet] - 10https://gerrit.wikimedia.org/r/277329 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [20:27:43] The good news is that it is trivial to use with new files that you'd like to track with LFS. [20:27:56] The bad news is that if you want to migrate a big file over to LFS, the docs say nothing. [20:28:01] Might as well be laughing at you. [20:28:11] So I have a hacky filter-tree script. [20:28:12] (03CR) 10Gehel: [C: 032] Puppet SSL dir is not the same on Production or Labs [puppet] - 10https://gerrit.wikimedia.org/r/277329 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [20:28:17] To preserve history. [20:28:23] * halfak punches self in face. [20:28:51] (03PS1) 10Andrew Bogott: enable_host_header = True in designate.conf [puppet] - 10https://gerrit.wikimedia.org/r/277336 [20:29:20] Maybe _joe_ has opinions? :) [20:29:50] TL;DR: I'm trying to work out whether moving some blobs in revscoring over to git LFS is a good idea for our eventual move to prod or not. [20:30:01] Looking for someone in ops to tell me to run away or invest. [20:30:14] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:30:31] eq gen maint at eqiad just completed no issues [20:30:48] (03CR) 10Andrew Bogott: [C: 032] enable_host_header = True in designate.conf [puppet] - 10https://gerrit.wikimedia.org/r/277336 (owner: 10Andrew Bogott) [20:31:05] RECOVERY - puppet last run on elastic1011 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:31:05] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:31:05] RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:31:12] which is pretty par for the course, eq does stuff they let us know is happening, we see no impact, yay stability =] [20:32:25] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:33:10] (03PS1) 10Dzahn: ganglia: on jessie, spawn aggregators with systemd [puppet] - 10https://gerrit.wikimedia.org/r/277340 (https://phabricator.wikimedia.org/T124197) [20:33:15] RECOVERY - puppet last run on elastic1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:35:22] halfak: fwiw, when git + really large files, i believe https://en.wikipedia.org/wiki/Git-annex so it indexes the files but does not store the content in the git history [20:36:14] (03PS1) 10Andrew Bogott: Sync designate.conf between kilo and liberty [puppet] - 10https://gerrit.wikimedia.org/r/277342 [20:36:40] log! HTTPS activated on elasticsearch (no client using it yet) [20:36:48] mutante, do we (WMF) use git-annex? [20:36:54] !log HTTPS activated on elasticsearch (no client using it yet) [20:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:25] it seems like a more cumbersome version of git LFS [20:37:26] halfak: i don't think so, i do know though that jzerebecki of WMDE has used it [20:37:32] Gotcha. [20:38:00] Just found this: http://workingconcept.com/blog/git-annex-vs-git-lfs [20:38:06] (03CR) 10Andrew Bogott: [C: 032] Sync designate.conf between kilo and liberty [puppet] - 10https://gerrit.wikimedia.org/r/277342 (owner: 10Andrew Bogott) [20:38:10] ah [20:39:06] Think of Git Annex as an experienced librarian waiting at the information desk. ..hehe ok [20:39:31] i didnt know GitLFS is what github uses [20:41:11] mutante, it would be nice to be able to use git LFS on our git servers so that I can clone my github repos to there. Does that sound crazy? [20:41:18] *mirror (not clone) [20:42:20] hmm, not very crazy [20:42:50] but.. our git servers are becoming phabricator [20:43:06] and things that the releng team could answer better [20:43:27] maybe you could bring up the question on #wikimedia-devtools [20:45:13] (03PS2) 10Dzahn: ganglia: on jessie, spawn aggregators with systemd [puppet] - 10https://gerrit.wikimedia.org/r/277340 (https://phabricator.wikimedia.org/T124197) [20:45:31] grrrr, grrrit-wm [20:46:22] (03CR) 10Dzahn: [C: 032] ganglia: on jessie, spawn aggregators with systemd [puppet] - 10https://gerrit.wikimedia.org/r/277340 (https://phabricator.wikimedia.org/T124197) (owner: 10Dzahn) [20:47:46] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:50:15] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2120066 (10greg) To clarify after reading this again and a question in a PM: 1) Dereckson himself is not the issue 2) by "cast the net too wide" I mean t... [20:50:52] mutante, will do. Thanks for the thoughts :) [20:51:10] halfak: you're welcome [20:51:31] So we are actually switching to difussion for code review? Or did I read that wrong? [20:53:50] Matthew_: https://phabricator.wikimedia.org/project/profile/9/ [20:53:53] Matthew_: differential, and yes, that is the plan [20:54:15] That explains why I couldn't find it, I was looking at tasks. And excellent! [20:54:28] Matthew_: separate is https://phabricator.wikimedia.org/T752 [20:54:44] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:54:46] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:55:14] Very cool! [20:55:16] THank you. [20:55:57] (03PS1) 10Andrew Bogott: Set api_base_uri in designate.conf [puppet] - 10https://gerrit.wikimedia.org/r/277343 [20:57:23] (03PS3) 10Andrew Bogott: Set api_base_uri in designate.conf [puppet] - 10https://gerrit.wikimedia.org/r/277343 [20:57:35] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:57:55] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:58:06] RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:58:15] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:58:35] RECOVERY - puppet last run on elastic1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:59:08] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:59:18] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2120089 (10RobH) a:5mark>3RobH I have to steal this back for update, as other allocations (sca and scb clusters) used up all the codfw spares. This request is n... [21:00:04] yurik maxsem: Respected human, time to deploy Kartographer extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160314T2100). Please do the needful. [21:00:16] huh? [21:00:45] nope [21:00:50] yurik, didja request another window or it's a glitch? [21:00:57] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2120107 (10RobH) [21:01:06] (03CR) 10Andrew Bogott: [C: 032] Set api_base_uri in designate.conf [puppet] - 10https://gerrit.wikimedia.org/r/277343 (owner: 10Andrew Bogott) [21:01:19] MaxSem, i think it got autocopied [21:02:04] MaxSem, want to remove it or should i? [21:02:18] MaxSem, ok,i ll do it [21:02:32] done [21:02:35] yurik, just did it [21:02:42] I won, heheh https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=history [21:02:43] i think we both just did [21:02:48] fine fine :-P [21:03:00] yours is fasta [21:04:26] jouncebot, next [21:04:26] In 1 hour(s) and 55 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160314T2300) [21:05:20] ah, such a smart botty [21:06:21] (03PS1) 10Dzahn: ganglia: fix systemd instance service name [puppet] - 10https://gerrit.wikimedia.org/r/277347 (https://phabricator.wikimedia.org/T124197) [21:07:27] (03PS2) 10Dzahn: ganglia: fix systemd instance service name [puppet] - 10https://gerrit.wikimedia.org/r/277347 (https://phabricator.wikimedia.org/T124197) [21:07:49] (03CR) 10Dzahn: [C: 032] ganglia: fix systemd instance service name [puppet] - 10https://gerrit.wikimedia.org/r/277347 (https://phabricator.wikimedia.org/T124197) (owner: 10Dzahn) [21:10:58] (03Abandoned) 10Dzahn: install: use raid1-lvm-ext4-srv.cfg on rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/276785 (https://phabricator.wikimedia.org/T129178) (owner: 10Dzahn) [21:14:29] 6Operations, 13Patch-For-Review: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#2120176 (10Dzahn) instances now get spawned by puppet on alsafi: ``` Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Agg... [21:27:20] hey, I'm getting an ntermittent DB locked due to lag errors on mw.org - is something broken/being maintained? [21:28:46] jynus, ^^ [21:30:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [21:31:10] (03PS1) 10Dzahn: ganglia: do not start meta-service on jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/277354 (https://phabricator.wikimedia.org/T124197) [21:32:05] (03PS2) 10Dzahn: ganglia: do not start meta-service on jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/277354 (https://phabricator.wikimedia.org/T124197) [21:33:41] (03CR) 10jenkins-bot: [V: 04-1] ganglia: do not start meta-service on jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/277354 (https://phabricator.wikimedia.org/T124197) (owner: 10Dzahn) [21:36:55] 6Operations, 10Wikimedia-General-or-Unknown, 15User-bd808: Update Wikimedia Debug extensions for Chrome and Firefox for configurable backend selection - https://phabricator.wikimedia.org/T129283#2120269 (10ori) >>! In T129283#2107443, @bd808 wrote: > Wikimedia Debug Header 0.5.0 for Firefox is now available... [21:37:24] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [21:37:44] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:37:56] PROBLEM - DPKG on logstash1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:38:16] PROBLEM - DPKG on logstash1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:38:52] that looks like search is being upgraded? [21:39:19] eh, not search, logstash of course [21:39:26] gehel, ^ [21:40:20] gehel: it doesnt like this part: [21:40:23] iF nginx-full [21:40:45] halF-conf [21:41:25] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:41:35] (03PS1) 10Aaron Schulz: Remove obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277359 [21:43:42] 6Operations, 10Wikimedia-General-or-Unknown, 15User-bd808: Update Wikimedia Debug extensions for Chrome and Firefox for configurable backend selection - https://phabricator.wikimedia.org/T129283#2120283 (10bd808) >>! In T129283#2120269, @ori wrote: >>>! In T129283#2107443, @bd808 wrote: >> Wikimedia Debug He... [21:47:56] (03CR) 10Chad: "Anything else we need to do here?" [puppet] - 10https://gerrit.wikimedia.org/r/262670 (owner: 10Chad) [21:49:24] (03PS3) 10Dzahn: ganglia: do not start meta-service on jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/277354 (https://phabricator.wikimedia.org/T124197) [21:50:13] 6Operations, 10ops-codfw: rack new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2120304 (10RobH) [21:50:29] 6Operations, 10hardware-requests, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2029006 (10RobH) [21:50:31] 6Operations, 10ops-codfw: rack new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2120322 (10RobH) [21:51:55] 6Operations, 10hardware-requests, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2029006 (10RobH) >>! In T126987#2098962, @Joe wrote: > since it seems clear to me that this system wil... [21:56:34] 6Operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: logstash - nginx failed service start - https://phabricator.wikimedia.org/T129934#2120380 (10Dzahn) [21:56:48] !log logstash - nginx -> T129934 [21:56:49] T129934: logstash - nginx failed service start - https://phabricator.wikimedia.org/T129934 [21:58:26] ACKNOWLEDGEMENT - DPKG on logstash1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn https://phabricator.wikimedia.org/T129934 [21:58:27] ACKNOWLEDGEMENT - DPKG on logstash1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn https://phabricator.wikimedia.org/T129934 [21:58:27] ACKNOWLEDGEMENT - DPKG on logstash1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn https://phabricator.wikimedia.org/T129934 [21:58:52] (03PS1) 10Hoo man: Log "Wikibase\Client\Changes\WikiPageUpdater" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277420 [22:00:27] (03CR) 10Hoo man: [C: 032] Log "Wikibase\Client\Changes\WikiPageUpdater" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277420 (owner: 10Hoo man) [22:02:10] (03PS4) 10Dzahn: ganglia: do not start meta-service on jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/277354 (https://phabricator.wikimedia.org/T124197) [22:03:44] * hoo eyes jenkins [22:04:45] "The system administrator who locked it offered this explanation: The database has been automatically locked while the slave database servers catch up to the master." :( [22:04:52] while saving on mediawikiwiki [22:07:02] Krenair, I've seen this too\ [22:08:25] eh? [22:08:32] (03CR) 10Hoo man: [V: 032] "Trivial" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277420 (owner: 10Hoo man) [22:08:33] I just got read-only for mw.org as well [22:08:48] opsennnnns ^^^^ [22:09:34] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Log 'Wikibase\Client\Changes\WikiPageUpdater' (duration: 00m 27s) [22:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:10:03] it's s3 [22:10:27] mutante: s3/mw.org is read-only, help [22:10:42] intermitently read-only, I must add [22:10:48] huh, [22:11:11] !log Restarted hhvm on mw1166 (was flooding with undefined variable notices) [22:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:04] I see a lot of CategoryMembershipChangeJob::run locks piling up [22:14:55] !log read-only mode, intermittently, on mw.org and other s3 wikis(?) [22:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:06] (03PS1) 10Jdlrobson: Avoid legacy overhead in mobile web experience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 [22:15:09] greg-g: i texted jaime [22:15:12] mutante: thanks [22:15:27] boy do we need more SF based opsen [22:15:36] s/SF based/west coast based/ [22:15:51] (and japan/australia) [22:16:27] yea, you mean DBA [22:16:28] s/japan/asia/ # I just jump to japan because I had to deal with Joi Ito on CC staff calls from Japan and that timezone) [22:16:36] mutante: all of the above :) [22:16:45] (03CR) 10Krinkle: [C: 04-1] "This runs globally and unconditionally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 (owner: 10Jdlrobson) [22:16:57] jdlrobson: I think you mean a different hook that runs only on mobile? [22:17:21] (03PS1) 10Thcipriani: Pass deploy user from service::node [puppet] - 10https://gerrit.wikimedia.org/r/277423 [22:17:22] right now it's all wikis that have MobileFrontend [22:17:38] (not when in MF, but when it is installed) [22:18:07] (03PS2) 10Jdlrobson: Avoid legacy overhead in mobile web experience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 [22:18:11] Krinkle: fixed up ^ [22:18:47] AaronSchulz, you're next person after DBA who might know about DB :) [22:19:20] I got a save (thanks VE for not throwing away my edits) [22:19:57] AaronSchulz: intermittent read-only on s3 wikis, nothing obvious in dbtree (says Krenair) [22:20:06] I said that? [22:20:18] I said ;) [22:20:26] says MaxSem, my bad :) [22:23:05] greg-g: You're welcome. :-) [22:25:48] At first I assumed Kr inkle said that [22:26:42] Krenair: it was bad switching back and forth between channels on my part [22:26:46] greg-g, where do you see read only mode? [22:27:32] jynus, when trying to edit/perform actions on mediawiki.org [22:27:46] point me to the logs? how many reports? [22:29:04] 6Operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714#2120513 (10Smalyshev) @RobH this has been somewhat neglected, can we ressurect it and do the space extension? Or do we have to do full reimage? [22:30:41] AaronSchulz: $ tail -f Wikibase\\Client\\Changes\\WikiPageUpdater.log | awk '/INFO: scheduleRefreshLinks: scheduling refresh links for/ { gsub(/.*scheduling refresh links for /, ""); print }' [22:30:49] On fluorin should give you the titles [22:31:05] I switched the logging on a couple of minutes ago [22:33:16] jynus, I see some in exception.log on fluorine [22:33:59] ok [22:37:00] Database is read-only: The database has been automatically locked while the slave database servers catch up to the master. [22:37:00] I saw this from a cron job (23 minutes ago if it's relevant) [22:37:17] just saw it now in the emails [22:37:28] *32 minutes [22:38:04] "Database is read-only" is a bad error message [22:38:18] it means "Mediawiki has disabled writes" [22:38:58] uh-oh 7 Could not update user with ID '0'; DB is read-only. [22:39:15] double error: w/o and id=0 :P [22:39:53] the particular error is Database is read-only: The database has been automatically locked while the slave database servers catch up to the master. [22:40:23] which comes from LoadBalancer [22:43:24] hmm, "To avoid creating high replication lag, this transaction was aborted because the write duration (8.7302453517914) exceeded the 6 seconds limit." [22:44:45] you are copying errors here with no control, most of those either either of issues known (like centralnotice banners) [22:44:51] that's been seen before by the CentralNotice admins IIRC [22:45:04] or things that could be explained by the extra load due to queue processing [22:45:36] I am still waiting for the "wikis are down" mesages that our monitoring hasn't cought [22:47:43] 6Operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714#2120593 (10RobH) At first glance, neither of these is showing the extra disks. It turns out these have the H310 controller, which means we need to e... [22:47:49] hmm, read-only errors appear on tail -n 800 /a/mw-log/exception.log | grep -P '^\d{4}-\d{2}-\d{2}' | grep -v '\bUsageException\b' | sed -r 's/^.*?\[\w+\]\s+\S+\s+//g' | cut -d ' ' -f 7- | sed -r 's/\{[^\{]+\}\s*$//' | sort | uniq -c | sort -rn [22:48:31] if -n is decreased to 700 there are no r/o errors which suggests we're not experiencing this problem right now [22:48:46] but that is a mediawiki log [23:00:04] RoanKattouw ostriches Krenair MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160314T2300). Please do the needful. [23:00:04] Dereckson MatmaRex ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:24] greg-g, are we OK to deploy? [23:01:19] greg-g, looks like we have no more errors [23:04:11] There are 9 patches + 7 scripts to run. I could postpone 276993 (noc) and 276919 (ko.wikt ns to remove, -1 script to run) if that would allow to decrease this SWAT workload. [23:04:49] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2120654 (10EBernhardson) [23:06:09] Dereckson, because we recently had problems an possibly still have them, I'm not deploying without an OK from greg-g [23:06:43] (my scripts are low-priority, by the way. config changes are probably more important) [23:07:33] MaxSem: I think we're ok [23:07:36] MaxSem: \o :) [23:07:54] greg-g, so we can let jynus go? [23:08:27] do not worry, I only have to get up in 6 hours for the read only tests [23:08:30] (please ping me when all of everyone else's stuff is done, and if we still have time afterwards :D ) [23:08:42] I am considering staying awake [23:09:10] jynus: :( [23:11:45] jynus: if you can't find anything, we'll deal [23:12:06] (03CR) 10MaxSem: [C: 032] Add files to noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276993 (https://phabricator.wikimedia.org/T116163) (owner: 10Dereckson) [23:12:41] (03Merged) 10jenkins-bot: Add files to noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276993 (https://phabricator.wikimedia.org/T116163) (owner: 10Dereckson) [23:13:59] !log maxsem@tin Synchronized docroot/noc: https://gerrit.wikimedia.org/r/276993 (duration: 00m 35s) [23:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:11] * MaxSem screams bloody murder [23:14:12] 183 [10000ms] at runtime/ext_mysql: slow query: SELECT /* CategoryMembershipChangeJob::run */ GET_LOCK('CategoryMembershipUpdates:868538', 10) AS lockstatus [23:14:39] ori, AaronSchulz this sucks, can we disable it ^^^ ? [23:15:16] MaxSem: what about it? [23:15:23] MaxSem: 276993 tested [23:15:38] 6Operations, 10MediaWiki-General-or-Unknown, 6Release-Engineering-Team: Intermittent read-only errors on s3 wikis on March 14th - https://phabricator.wikimedia.org/T129947#2120701 (10greg) [23:15:52] AaronSchulz, 18.3% of our problems right now are queries taking more than 10 seconds :D [23:16:37] it's a job query waiting on a lock, the actual updates are batched and do waits on the off-chance they are large sometimes [23:17:19] (03CR) 10MaxSem: [C: 032] Set logo and site name on gu.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276547 (https://phabricator.wikimedia.org/T122407) (owner: 10Dereckson) [23:17:21] I don't see a connection to replag or anything [23:17:37] (03CR) 10jenkins-bot: [V: 04-1] Set logo and site name on gu.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276547 (https://phabricator.wikimedia.org/T122407) (owner: 10Dereckson) [23:17:54] Dereckson, merge conflict ^^^ [23:17:58] Rebasing. [23:18:20] (03CR) 10MaxSem: [C: 032] Enable SandboxLink on sr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276485 (https://phabricator.wikimedia.org/T129485) (owner: 10Dereckson) [23:18:57] (03Merged) 10jenkins-bot: Enable SandboxLink on sr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276485 (https://phabricator.wikimedia.org/T129485) (owner: 10Dereckson) [23:20:00] (03PS2) 10Dereckson: Set logo and site name on gu.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276547 (https://phabricator.wikimedia.org/T122407) [23:20:00] Rebased. ^ [23:21:20] (03CR) 10MaxSem: [C: 032] Set logo and site name on gu.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276547 (https://phabricator.wikimedia.org/T122407) (owner: 10Dereckson) [23:21:22] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/276485 (duration: 00m 26s) [23:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:47] MaxSem: 276485 tested [23:21:52] (03Merged) 10jenkins-bot: Set logo and site name on gu.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276547 (https://phabricator.wikimedia.org/T122407) (owner: 10Dereckson) [23:22:13] Krinkle: https://gerrit.wikimedia.org/r/#/c/277353/ [23:22:47] !log maxsem@tin Synchronized static/images/project-logos/guwiktionary.png: https://gerrit.wikimedia.org/r/#/c/276547/ (duration: 00m 30s) [23:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:41] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/276547/ (duration: 00m 26s) [23:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:36] (03CR) 10MaxSem: [C: 032] Women's writes WikiWarriors edit-a-thon throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276895 (https://phabricator.wikimedia.org/T129697) (owner: 10Dereckson) [23:25:30] MaxSem: 276547 tested [23:25:38] (03Merged) 10jenkins-bot: Women's writes WikiWarriors edit-a-thon throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276895 (https://phabricator.wikimedia.org/T129697) (owner: 10Dereckson) [23:26:10] !log maxsem@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/276895/ (duration: 00m 26s) [23:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:56] (03CR) 10MaxSem: [C: 032] Configure upload rights on ce.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276378 (https://phabricator.wikimedia.org/T129005) (owner: 10Dereckson) [23:27:31] (03Merged) 10jenkins-bot: Configure upload rights on ce.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276378 (https://phabricator.wikimedia.org/T129005) (owner: 10Dereckson) [23:27:53] (03PS1) 10Jdlrobson: WikidataPageBanner config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277431 (https://phabricator.wikimedia.org/T129099) [23:28:30] (03CR) 10jenkins-bot: [V: 04-1] WikidataPageBanner config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277431 (https://phabricator.wikimedia.org/T129099) (owner: 10Jdlrobson) [23:28:54] !log maxsem@tin Synchronized dblists/commonsuploads.dblist: SWAT (duration: 00m 26s) [23:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:41] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 27s) [23:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:56] note: I'm deploying out of the order to do ns changes last as they require maint scripts run [23:31:21] MaxSem: nothing testable for 276895, 276378 tested to the extent the group is well created in the Special page for user rights groups [23:31:38] Dereckson, thanks, you're awesome! :D [23:32:37] (03CR) 10Krinkle: Avoid legacy overhead in mobile web experience (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 (owner: 10Jdlrobson) [23:33:27] !log maxsem@tin Synchronized php-1.27.0-wmf.16/extensions/CirrusSearch/: (no message) (duration: 00m 32s) [23:33:28] ebernhardson, ^^^ [23:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:10] MaxSem: checking [23:34:11] (03PS3) 10Jdlrobson: Avoid legacy overhead in mobile web experience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 [23:34:14] ebernhardson, ^^^ [23:34:17] !log maxsem@tin Synchronized php-1.27.0-wmf.16/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/277337/ (duration: 00m 26s) [23:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:52] (03CR) 10MaxSem: [C: 032] Remove Wikisaurus namespace from ko.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276919 (https://phabricator.wikimedia.org/T129631) (owner: 10Dereckson) [23:36:26] (03Merged) 10jenkins-bot: Remove Wikisaurus namespace from ko.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276919 (https://phabricator.wikimedia.org/T129631) (owner: 10Dereckson) [23:36:46] MaxSem: so far so good, i'll double check the logging coming in in a few once it has time to hit a few users [23:37:18] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/276919/ (duration: 00m 27s) [23:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:25] !log mwscript namespaceDupes.php --wiki=kowiktionary gives no pages to fix [23:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:49] (03CR) 10MaxSem: [C: 032] Create Draft namespace on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276743 (https://phabricator.wikimedia.org/T129052) (owner: 10Dereckson) [23:40:24] (03Merged) 10jenkins-bot: Create Draft namespace on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276743 (https://phabricator.wikimedia.org/T129052) (owner: 10Dereckson) [23:41:38] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/276743/ (duration: 00m 26s) [23:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:39] (03CR) 10Krinkle: [C: 031] Avoid legacy overhead in mobile web experience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 (owner: 10Jdlrobson) [23:43:22] 6Operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714#2120869 (10Smalyshev) How long would reimaging take? In principle, I'm OK with it but would like to know when it can happen and for how long it would... [23:43:24] Dereckson, https://gerrit.wikimedia.org/r/#/c/276743/1/wmf-config/InitialiseSettings.php has space instead of underscore - please fix or this needs to be reverted [23:43:48] I don't see the namespace @ https://kn.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cstatistics [23:43:53] oh okay [23:43:56] (03PS4) 10Krinkle: Avoid legacy overhead in mobile web experience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 (owner: 10Jdlrobson) [23:43:59] Fixing that. [23:44:05] (03CR) 10Krinkle: "Removed unused global and double line break." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 (owner: 10Jdlrobson) [23:44:16] jdlrobson: I'll unswat it later tonight if that's okay [23:44:18] nice 5kb saver [23:44:21] MatmaRex, you can runn on small wikis meanwhile [23:44:21] (03CR) 10Jdlrobson: [C: 031] "thanks Timo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277422 (owner: 10Jdlrobson) [23:44:26] Krinkle: that would be great [23:44:38] i like saving 5kb ;-) [23:44:48] MaxSem: i can't. you can. wanna? :D [23:45:01] heh [23:46:19] (03PS1) 10Dereckson: Fix namespace configuration for kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277432 (https://phabricator.wikimedia.org/T129052) [23:46:27] {{I thought you were already an admin}} [23:46:28] MaxSem: here the fix ^ [23:46:39] !log ran mwscript maintenance/updateCollation.php --wiki=ruwikibooks --force [23:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:08] (03CR) 10MaxSem: [C: 032] Fix namespace configuration for kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277432 (https://phabricator.wikimedia.org/T129052) (owner: 10Dereckson) [23:47:22] MaxSem: don't want to. i'm wearing enough hats already. :) [23:47:33] (03Merged) 10jenkins-bot: Fix namespace configuration for kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277432 (https://phabricator.wikimedia.org/T129052) (owner: 10Dereckson) [23:47:55] LET'S TALK WHO'S COMMITTED TO THE CAUSE [23:48:52] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/277432 (duration: 00m 26s) [23:48:54] (03CR) 10Dereckson: "Follow-up: Ie186a54bbdedb9a3ec19abea6e3ff9ce43c00202" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276743 (https://phabricator.wikimedia.org/T129052) (owner: 10Dereckson) [23:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:45] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [23:50:05] MaxSem: 276743+276743 tested [23:50:46] !log mwscript namespaceDupes.php --wiki=knwiki --fix produced 3changes unrelated to new namespaces, --source-pseudo-namespace gave no results [23:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:46] !log ran mwscript maintenance/updateCollation.php --wiki=ruwikivoyage --force [23:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:52:51] !log ran mwscript maintenance/updateCollation.php --wiki=ruwikiversity --force [23:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:25] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:50] MaxSem: i wonder how long are they actually taking? [23:55:02] MaxSem: can you run the bigger ones with `time`? i'm mostly just curious. [23:55:06] roughly as much as you estimated [23:55:23] !log ran mwscript maintenance/updateCollation.php --wiki=ruwikiquote --force [23:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:44] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 70039 bytes in 0.184 second response time