[00:08:24] (03CR) 10Dereckson: [C: 031] Enable signature button at NS:102 for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio) [00:08:58] (03CR) 10Dereckson: "As indicated on the ticket, the namespace is to organize content, not content per se, so no, you don't need to add it there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio) [00:14:08] (03PS2) 10Dereckson: Enabled ogg opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:23:30] (03PS3) 10Dereckson: Enabled ogg opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:23:44] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: puppet fail [00:24:06] ori: you around? [00:24:44] RECOVERY - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is OK: TCP OK - 0.006 second response time on port 9042 [00:25:31] (03CR) 10Dereckson: [C: 031] "PS3: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:26:21] (03CR) 10Paladox: "@Dereckson how do I do that please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:27:13] (03CR) 10Dereckson: [C: 031] "@Glaisher Could you schedule it for SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [00:28:16] paladox: you have the details on https://wikitech.wikimedia.org/wiki/SWAT_deploys and the checklist of points to verify [00:28:25] Ok thanks [00:29:07] paladox: if you think your patch matches these criterias (it looks so for me), you can add it to the table at https://wikitech.wikimedia.org/wiki/Deployments#Week_of_March_28th [00:29:32] Dereckson: thanks do i do it for morning or afternoon [00:29:35] you need to be on this channel the deployment hour, and be ready to test if the change works like expected [00:29:58] there are two windows to allow for several timezone and working time, so it's up to you [00:30:33] (03CR) 10Jforrester: [C: 04-2] "We do not test things in production before we put them in master except in very rare circumstances. Unless Brion OKs this, please do not d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:30:38] Dereckson: Oh, i wont be able to test since i only proposed the patch. thedj would you be able to test it please [00:30:50] thedj https://gerrit.wikimedia.org/r/#/c/256967/ [00:31:26] James_F: what do you suggest, to test it on beta cluster? [00:31:58] Dereckson: If the config setting isn't good enough to enable in the extension by default, local testing would be a start. :-) [00:34:21] I were under the assumption it had been tested, as it's a part of the extension, just not enabled by default. [00:35:51] I'm not sure it's been tested on a multi-wiki site, for instance. Then into Beta Cluster. [00:35:54] yeah no that patch looks super wrong [00:36:05] it would disable ogg vorbis output [00:36:07] We don't throw things into live production and then ask Commons community members to tell us if it works for them. [00:36:10] That's not cool. [00:36:21] bd808: are you around? [00:36:35] Indeed, that's not. [00:36:48] Dereckson: That's the justification on that patch, though. [00:38:14] (03CR) 10Brion VIBBER: [C: 04-1] "Looking quickly at this config patch, appears that it would enable Opus but disable Vorbis output for non-Vorbis sources, which could be b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:38:40] As a side question, it could be interesting to revisit the extension code. To let toxic code - even commented - in master branch isn't helpful. [00:40:46] Dereckson: Oh, indeed. TMH is a complete nightmare. thedj and brion are awesome for trying to tackle it. [00:40:58] it'll be a while yet before it doesn't suck ;) [00:41:24] i would actually like to enable opus for audio at some point [00:41:37] ms edge should be getting native opus decoding in future [00:42:08] brion: As a transcoding target or just as an alternative format? [00:42:28] James_F: as transcode target; we already allow uploads i think if you format em right [00:42:39] So, three of them? ;-) [00:42:58] hmm actually that reminds me [00:43:02] Do we have to give Ops the heads-up if we add new transcode targets? Can't they use quite a lot of space? [00:43:06] for edge i don't know if they'll support ogg container :D [00:43:18] so might need an opus-in-webm audio output ;) [00:43:23] We should ship inside mkv. ;-) [00:43:30] James_F: yeah i would recommend that. for audio only it shouldn't be huge amount of space though [00:43:35] * James_F mumbles about container fanboyism. [00:43:38] webm == mkv, almost ;) [00:43:42] well subset [00:43:48] * James_F nods. [00:44:18] edge is testing vp9 again in latest preview build \o/ [00:44:21] still experimental [00:44:30] and doesn't yet work with our stuff [00:44:30] I've prepared https://gerrit.wikimedia.org/r/278427 to add a warning about this setting, would that be valuable to merge? [00:44:56] brion: Details. :-) [00:44:59] Dereckson: that looks wrong [00:45:04] Dereckson: should be fine to enable them both at once [00:45:16] the config patch was only enabling opus and disabling vorbis [00:45:28] if it were correct it'd work fine afaik [00:49:32] Equivalent to the difference between 'enwiki' => … and '+enwiki' => … :-) [00:50:16] yeah i think it might work with += instead of = [00:50:30] or... /me is always suspicious of php arrays [00:51:09] And config could be more easy to read with the two explicitely noted. [00:52:07] So would it be acceptable to (1) define wgEnabledAudioTranscodeSet to both WebVideoTranscode::ENC_OGG_OPUS WebVideoTranscode::ENC_OGG_VORBIS (2) enable this on beta.wmflabs.org or should further test needed beforehand? [00:52:08] ok $foo[] = blah would be better here [00:52:38] Dereckson: i think that should be ok yeah, either set them both explicitly or add the ENC_OGG_OPUS on top without replacing the whole array [00:52:58] note it's a straight vector of string constant keys, not an associative array [00:53:05] so it's a little funky compared to a lot of our settings [00:54:41] (03CR) 10Brion VIBBER: "Best to test on beta cluster first. :) Looks like we should either include the vorbis setting in this array explicitly as well, or only ap" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:54:55] Dereckson: my sample line there should i think work [00:56:26] (03PS4) 10Dereckson: Enabled Ogg Opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:56:35] might also consider holding off on opus audio until can do audio-only transcode output in webm container as that's more likely to be supported by future devices than opus in ogg, but i'm not against it :) [00:57:44] (03CR) 10Dereckson: "PS4: setting moved to beta, default codec not overwritten" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [00:58:07] (03CR) 10Brion VIBBER: [C: 031] "Looks like it should work correctly, and now is moved over to beta. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [01:00:10] (03CR) 10Brion VIBBER: "Note it may be more worthwhile to actually deploy opus audio in .webm container instead of Ogg/.opus, as Matroska/WebM container is being " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [01:17:06] (03PS3) 10Dereckson: Add initial rescore profiles for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse) [01:17:49] (03CR) 10Dereckson: "PS3: rebased, and moved config to *-labs per EBernhardson comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse) [01:22:55] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Puppet has 1 failures [01:23:12] (03PS2) 10Dereckson: Reduce sampling rate for language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) (owner: 10Bmansurov) [01:24:30] (03CR) 10Dereckson: "PS2: rebased, added reference to the task ID" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) (owner: 10Bmansurov) [01:29:00] urandom: I wasn't then, but I am now [01:29:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:29:13] ori: hi! :) [01:29:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:29:34] ori: i was going to see if i could convince you to +2 https://gerrit.wikimedia.org/r/#/c/278402/1 [01:30:20] it will add the config or a second cassandra instance on restbase1013, part of a long-running and totally boring (read: safe) process [01:30:28] s/or a second/for a second/ [01:30:35] Heh, I was just about to say -- I can't meaningfully review it, but if you tell me it's safe, that's fine by me. [01:30:53] What's the worst that could happen? [01:30:55] it's so routine at this point it is bording on tedious [01:31:10] the new instance could fail to bootstrap leaving exactly where we are now [01:31:19] leaving us, that is [01:31:27] (03CR) 10Ori.livneh: [C: 032] enable instance 'b'; restbase1013-b [puppet] - 10https://gerrit.wikimedia.org/r/278402 (https://phabricator.wikimedia.org/T125842) (owner: 10Eevans) [01:31:34] ori: awesome, thanks [01:31:48] (03PS5) 10Dereckson: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow) [01:32:14] (03CR) 10jenkins-bot: [V: 04-1] Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow) [01:32:50] puppet-merge is being s l o w [01:32:54] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:32:54] so it is not merged yet [01:33:01] and look, we fixed the site! ^ [01:33:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:33:21] it's magic! [01:34:07] finally merged [01:34:19] do you need me to run puppet somewhere? you have sudo on the relevant nodes, right? [01:34:25] i do yes [01:34:34] just fired it off [01:37:35] ori: there will be a cql service failure for the new instance shortly, totally expected, i'll ack it when the time comes [01:38:29] it'll clear when the node finishes bootstrapping and goes online [01:38:47] (03PS6) 10Dereckson: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow) [01:43:32] (03CR) 10Dereckson: [C: 031] "PS6: rebased, .gitignore doesn't contain any reference to images anymore by the way" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow) [01:47:56] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:49:24] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:54:05] !log bootstrapping restbase1013-b.eqiad.wmnet : T125842 [01:54:06] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [01:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:54:16] better late than never... [01:54:30] (03CR) 10Paladox: "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [02:05:33] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia, and 3 others: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2135995 (10matmarex) There's actually an UploadWizard-specific bit here, see T130437. That only affects Internet Explorer, o... [02:06:27] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: Connection refused [02:06:48] ^^^ there it is; got this [02:07:18] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-03-20 02:07:03. [02:13:06] (03PS2) 10Dereckson: Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [02:13:17] (03CR) 10Dereckson: [C: 031] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [02:23:15] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 10m 07s) [02:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:38] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: puppet fail [02:24:17] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:38] (03CR) 10Smalyshev: [C: 031] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [02:27:56] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:28:07] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:48] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:31:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Mar 19 02:31:46 UTC 2016 (duration 8m 31s) [02:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:52] (03CR) 10Tim Landscheidt: [C: 031] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [03:33:36] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:17] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:35:24] (03PS2) 10Tim Landscheidt: diamond: Remove unnecessary/incorrect include of stdlib [puppet] - 10https://gerrit.wikimedia.org/r/273483 [03:35:26] (03PS3) 10Tim Landscheidt: diamond: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/273451 [03:40:36] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:42:16] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [04:11:57] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:18] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [04:32:56] mutante:o/ [04:34:36] getting this error when I run puppet agent -tv after applying a puppet::self role. It used to work few days back. Anything changed? https://gist.github.com/sabyasachi/90439a41fb564a605b6c [04:35:50] in this instance: https://wikitech.wikimedia.org/wiki/Nova_Resource:Sabya4.ores-staging.eqiad.wmflabs [04:37:25] sabya: the Puppet run is failing because Puppet is configured to ensure the Puppetmaster service is running, and the service failed to start. Did you follow the advice in the output? [04:37:31] > Job for puppetmaster.service failed. See 'systemctl status puppetmaster.service' and 'journalctl -xn' for details. [04:39:15] ok. [04:42:03] puppetmaster failed to start because of cert errors [04:44:43] ori: Could not request certificate: Connection refused - connect(2) for "localhost" port 8140 [04:44:51] could this be the reason? [04:48:50] it probably is, yeah [04:52:31] (03CR) 10EBernhardson: "actually my labs comment was about something slightly different, the labs here is the beta cluster, which has a tiny index." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse) [04:54:47] (03CR) 10EBernhardson: "another thing that might be useful, we are working up a relevance forge project which is about being able to run sets of queries and judge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse) [04:59:56] ori: got it working. [05:00:02] \o/ [05:00:43] * sabya is noob in puppet [05:05:27] PROBLEM - puppet last run on mw2038 is CRITICAL: CRITICAL: puppet fail [05:16:58] ori: i'm wondering if the mobileapps flapping was due to that restbase change a little bit beforehand. reading the backscroll, urandom seemed confident it wouldn't cause trouble, though, so who knows. we haven't changed anything lately. [05:17:04] ori: i'll keep an eye on it. [05:18:48] ori: thanks again for the heads-up [05:19:01] mdholloway: np -- I still think something is amiss, tho [05:19:12] you can see in ganglia that CPU usage is lower than it has been in recent days: [05:19:26] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Service+Cluster+B+eqiad&h=scb1001.eqiad.wmnet&jr=&js=&v=4.9&m=cpu_user&vl=%25&ti=CPU+User [05:19:44] i see service workers dying left and right [05:19:48] https://www.irccloud.com/pastebin/nlhv5oQm/ [05:19:51] ori: you're right, that doesn't look good. [05:19:55] more worryingly, bytes in/out are flat [05:19:58] memory issues [05:20:06] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Service+Cluster+B+eqiad&h=scb1001.eqiad.wmnet&jr=&js=&v=82660.59&m=bytes_out&vl=bytes%2Fsec&ti=Bytes+Sent [05:22:18] ori: interesting spike before it got so low [05:24:06] bearND: yeah, i saw that too. memory use went crazy for some reason on both scb1001 & scb1002 ca. 02:17 UTC [05:25:02] the spike is neatly symmetrical with a spike on logstash -- http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Logstash+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [05:25:15] so probably there was a bust of log messages around that time [05:25:22] *burst [05:25:52] the numbers match, ~3M/s at peak [05:26:03] ori: yep. [05:27:12] I have to run, sorry :-/ don't hesitate to page ops if there is substantial user impact and you are blocked [05:28:43] ori thanks for notifying us [05:29:29] ori: no worries, thanks again! [05:30:30] bearND: the app seems to still be working fine, at least [05:31:28] RECOVERY - puppet last run on mw2038 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:31:49] mdholloway: i think that's due to pre-generation. Most pages should be stored in RB Cassandra already. The impact most likely would be that latest revisions don't get updated as quickly as they used to [05:43:03] hmm, if I read https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase correctly it looks like other RB services are having heap issues, too [05:44:06] restbase1003 through 1011 [05:46:19] bearND: hmm, yeah. [05:46:56] bearND: if you change the timeline to 6 hours ago through a few seconds ago, looks like there was a huge spike around 00:20 [05:47:07] i don't know much about cassandra, but i do wonder a bit what limit's those are, 460M heap total seems tiny for a java process [05:48:39] could be. I'm no cassandra expert. [05:50:35] the heap usage graph in grafana looks fairly typical for a java process, it's not really running out of memory because gc looks to be collecting it back down to 20-40% (although i'm mostly just guessing here too): https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-gc?panelId=34&fullscreen [05:51:16] I was going to see if VE would be impacted but strangely I only see wikitext editing on enwiki or eswiki. Did I miss anything re: VE lately? [05:53:34] mdholloway: does VE work for you? [05:54:41] bearND: i'm not seeing it on enwiki either [05:54:45] (03CR) 10OliverKeyes: "As mentioned on Phabricator I'm not quite whether this change now adds anything: the underlying pageview definition has been patched for q" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis) [05:57:04] still working on mediawiki.org [06:02:25] ok [06:11:01] mdholloway: not sure what we can do here. I'm going to send email to gwicke, mobrovac and the rest of the services team [06:12:01] bearND: sounds good to me. especially from the more recent errors in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase this seems likely a restbase/cassandra issue [06:13:48] mdholloway: i tend to agree [06:16:07] bearND: all right, i was just about to go to bed before ori's email showed up, so i think i'll head off. [06:16:13] bearND: good night! [06:16:30] mdholloway: good night. Me too [06:29:57] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:16] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:47] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: puppet fail [06:30:48] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:37] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:17] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:28] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:06] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:17] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:48] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:57] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: puppet fail [07:12:16] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:19:46] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:26:46] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:52:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:52:47] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [07:58:21] (03PS4) 10Nemo bis: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) [08:03:31] (03Abandoned) 10Kelson: Fix regex to enable upload from ETHZ Library with the GWT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273774 (owner: 10Kelson) [08:03:51] (03PS7) 10Nemo bis: Enable Translate extension on AffCom (chapcomwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio) [08:15:27] (03PS1) 10Yuvipanda: tools: Do not have static class inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) [08:15:54] (03CR) 10Yuvipanda: "I think I27a39b3352abb93babc7ed19b642f76524470c2d is the right way to do this." [puppet] - 10https://gerrit.wikimedia.org/r/277862 (https://phabricator.wikimedia.org/T128411) (owner: 10Tim Landscheidt) [08:48:36] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:49:57] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:50:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [08:51:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:52:06] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [08:52:28] PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server [08:54:06] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [08:58:47] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [08:59:57] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [09:04:27] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [09:04:37] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:06:16] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [09:06:26] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [10:21:47] thumbnail generation problems at wikimania 2016 wiki [10:21:51] worth opening a bug? [10:24:07] PROBLEM - Disk space on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:46] RECOVERY - Disk space on labstore2001 is OK: DISK OK [10:38:47] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: puppet fail [10:41:37] 6Operations, 10MediaWiki-Uploading, 6Multimedia: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Steinsplitter) [10:44:08] RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on port 9042 [10:51:50] !log Labs LDAP is probably down. T130446 Cant log to tools-login.wmflabs.org / Jenkins interface and Nodepool yields error 500 communicating with OpenStack API [10:51:51] T130446: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446 [10:54:57] 6Operations, 6Multimedia: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136300 (10Peachey88) [10:55:33] 6Operations, 6Multimedia: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Peachey88) -#mediawiki-uploading nothing to do with MediaWiki's internal uploading system. [11:06:05] (03PS2) 10Nemo bis: Disable upload on ia.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278411 (https://phabricator.wikimedia.org/T130425) (owner: 10Dereckson) [11:07:07] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:38:06] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.020 seconds response time [11:38:12] !log restart slapd on seaborgium, oom-killed [11:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:58] Vito: yes please, a bug would be appreciated [11:47:07] godog: already done! [11:47:14] https://phabricator.wikimedia.org/T130448 [11:48:37] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [11:48:53] Vito: sweet, thanks! [11:54:57] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:56:17] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:32:34] (03CR) 10Mobrovac: [C: 031] make logstash messages separable by cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) (owner: 10Eevans) [12:51:17] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1177814 (10Paladox) Hi how will we still use jsduck when migrating to Jessie npm 4.3. [13:41:10] 6Operations, 6Multimedia, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136461 (10Danny_B) [13:58:37] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89399.00 seconds [14:48:07] (03PS1) 10Aklapper: Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) [14:50:14] (03CR) 10Dereckson: [C: 031] Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) (owner: 10Aklapper) [14:51:04] (03CR) 10Gergő Tisza: [C: 031] Logging: add ApiAction kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278347 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [14:53:36] Hi. Could someone deploy https://gerrit.wikimedia.org/r/278446 — this is a throttle rule for an event for this Sunday filled in last minute. [15:00:57] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [15:01:28] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [15:02:46] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [15:03:16] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042 [15:40:14] (03CR) 10Reedy: [C: 032] Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) (owner: 10Aklapper) [15:40:39] (03Merged) 10jenkins-bot: Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) (owner: 10Aklapper) [15:41:10] Thanks Reedy. [15:43:07] (03PS1) 10Reedy: Remove old throttle rules. Swap array() -> [] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278454 [15:43:51] !log reedy@tin Synchronized wmf-config/throttle.php: Throttle rules for event T130447 (duration: 00m 26s) [15:43:52] T130447: Raise throttling cap on user registration, image upload on commons.wikimedia.org and te.wikipedia.org on 2016-03-20 - https://phabricator.wikimedia.org/T130447 [15:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:11] (03CR) 10Dereckson: [C: 031] Remove old throttle rules. Swap array() -> [] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278454 (owner: 10Reedy) [16:06:23] (03PS1) 10Halfak: Adds compute node to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 [16:13:46] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:14:07] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [16:14:24] (03PS2) 10Halfak: Adds compute node to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T12345) [16:16:38] (03PS3) 10Halfak: Adds compute node to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) [16:31:38] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [16:33:25] Hey folks. I'm blocked on getting a puppet change merged. [16:33:27] https://gerrit.wikimedia.org/r/#/c/278413/ [16:33:31] Can someone have a look? [16:33:46] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [17:30:14] (03CR) 10Gehel: [C: 032] Adding a `$ensure` parameter to nginx::status_site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/278276 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [17:50:51] (03PS3) 10Gehel: Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) [17:51:46] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:52:40] halfak: are you still there? [17:52:47] Yeah! [17:53:25] How urgent is that merge? Can it wait next puppet SWAT? [17:53:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [17:54:00] gehel, hmm... Shouldn't affect anyone by me. [17:54:15] And my instances in labs [17:54:57] halfak: lemme dig a bit into it... [17:55:01] kk [17:55:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [17:55:18] FWIW, this is for the `ores` project. I'm the maintainer. [17:56:24] I guess it isn't that time critical now. I'm in the middle of a downtime event, so I just manually worked with apt-get to install my dependencies. :/ [17:57:43] I'm fairly new here, I need some reflection time before playing cowboy during the weekend. [17:57:52] but the patch seems trivial enough... [17:58:27] which instances is it on labs? Can I have a look in them? [17:59:41] gehel, heh. All the ones in the ores project on labs [17:59:54] I wouldn't sweat it if you feel uncomfortable. [18:00:19] do you have the name of one of them? [18:00:40] ores-web-05.eqiad.wmflabs [18:00:49] thanks! [18:02:26] generally, looks for ores-(web|worker|staging)-[0-9]{2}.eqiad.wmflabs [18:03:19] strange, I don't have access to those. [18:03:31] * gehel does not understand fully how labs access work [18:04:19] halfak: I was wondering which puppetmaster those machines use... and see if you could not cherry-pick your change while waiting for an actual merge. [18:04:41] halfak: ideally, you should not be blocked waiting for Ops on labs... [18:05:10] gehel, honestly, don't worry about it. I have downtime and an incident report to worry about right now anyway [18:05:41] I'll merge it right now... The change is trivial enough that I can feel good about it... [18:06:04] :) thanks [18:06:20] give me just 5 minutes to rebase it and merge... [18:06:55] (03PS2) 10Gehel: Adds arabic and polish languages files to ores role. [puppet] - 10https://gerrit.wikimedia.org/r/278413 (owner: 10Halfak) [18:08:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:09:10] (03CR) 10Gehel: [C: 032] "Change seems trivial enough, merging as discussed with @halfak" [puppet] - 10https://gerrit.wikimedia.org/r/278413 (owner: 10Halfak) [18:09:28] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:10:04] halfak: Merged. I'll be around, ping me if you see anything suspicious... [18:10:14] Thanks gehel [18:10:21] Just running puppet on my last replacement worker [18:10:25] Glad to help [18:10:31] So I'll know right away [18:11:52] Looks good [18:21:16] 6Operations, 7Wikimedia-log-errors: "internal_api_error_MWException: [dbf916b7] Exception Caught: Could not acquire lock for" for some uploads (during upload with Pywikibot OAuth) - https://phabricator.wikimedia.org/T129621#2136832 (10Aklapper) [19:23:16] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [19:23:27] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [19:32:26] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [19:33:57] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [19:55:11] 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136927 (10Krenair) I don't think upload.wikimedia.org has anything to do with apache. I tried `curl -H "Host: upload.wikimedia.org" http://ms-fe.svc.eqiad.wmnet/` on bast1001 and got a... [19:55:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:57:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:06:43] 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136932 (10Krenair) No idea where that response file actually comes from either :/ [20:19:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:27:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [20:37:57] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:48:06] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [20:48:27] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [20:52:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:52:27] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:00:54] 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10MZMcBride) Redirecting "/" on upload.wikimedia.org on both HTTP and HTTPS to seems reasonable and clean to me. [21:02:37] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [21:03:58] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [21:25:25] 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Southparkfan) @Krenair where does that come from (run curl with -v, and then Server: header)? [21:28:27] 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136996 (10Krenair) There is no Server header in that response @Southparkfan [21:40:30] 6Operations, 6Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2137010 (10hashar) [21:48:32] 6Operations, 10Traffic, 7Design: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2137019 (10MZMcBride) [21:52:24] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#2137023 (10hashar) @Paladox by using the `rake-jessie` job which rely on bundler to download dependencies from ruby gems.... [22:07:13] !log clearing snapshots on restbase2004.codfw.wmnet [22:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:46] !log removing 22G of heap dumps [22:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:56] !log removing 22G of heap dumps from restbase2004.codfw.wmnet [22:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:58] !log powercycling oxygen, looks kernel-dead [22:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:07] RECOVERY - Host oxygen is UP: PING OK - Packet loss = 0%, RTA = 4.72 ms [22:46:58] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [23:00:57] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:13:56] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:20:48] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [23:43:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:47:38] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures