[00:02:38] (03PS1) 10Hoo man: Remove unused $wmgMediaViewerBeta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164884 [00:16:59] (03PS1) 10Ori.livneh: apache_status: get stats from 127.0.0.1, not localhost [puppet] - 10https://gerrit.wikimedia.org/r/164885 [01:34:19] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 1017 MB (3% inode=99%): [02:17:40] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-06 02:17:40+00:00 [02:17:53] Logged the message, Master [02:28:42] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-06 02:28:42+00:00 [02:28:50] Logged the message, Master [02:58:33] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [02:59:07] !log springle Synchronized wmf-config/db-eqiad.php: depool db1060 (duration: 00m 06s) [02:59:17] Logged the message, Master [03:08:53] springle: hey, how are the _content_model schema changes going? :) [03:18:05] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [03:18:09] legoktm: stalled for about the last fortnight, due to unrelated stuff. continuing now, which is what db1060 is doing ^ [03:20:01] springle: ah ok. yay though, thanks! :D [03:26:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Oct 6 03:26:31 UTC 2014 (duration 26m 30s) [03:26:38] Logged the message, Master [03:46:20] RECOVERY - Disk space on ocg1001 is OK: DISK OK [04:06:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: ! XO (WA/OGXX/563343) {#2009} [10Gbps DF]BR [04:08:45] did someone clean out ocg1001 again? (recovery above) [04:10:02] Coren: does that cr1-eqiad alert look scary? [04:10:57] bblack: look at the graph on upper left of http://status.wikimedia.org/8777/155942/DNS [04:17:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [04:20:44] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [04:33:58] ok, cleaned up the telia mess in RT [04:34:15] Coren: btw, RT 8554 in case you missed it [04:38:07] i don't see any hits for "host:stat1001". maybe it's not hooked up to logstash [04:39:06] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:43:31] (03PS1) 10Chmarkine: phabricator - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/164897 (https://bugzilla.wikimedia.org/38516) [04:54:13] (03PS1) 10Yurik: Added 4 ppl to notifyOnAllChanges for zero portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164902 [06:28:17] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [06:28:17] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [06:28:17] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [06:28:37] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:29:17] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [06:29:57] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:27] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 5 failures [06:30:27] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:47] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:56] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:39:56] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:25] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:44] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:14] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [06:47:36] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:48:45] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:55:45] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:16:20] good morning [07:32:07] (03CR) 10Hashar: [C: 031] "I haven't noticed the Xvfb is being passed a display port of 94 to override the default display port 99. Feel free to cherry pick and te" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [07:37:17] (03PS3) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/163578 [07:43:36] (03Abandoned) 10Hashar: add www.soumaya.com.mx to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161192 (https://bugzilla.wikimedia.org/70986) (owner: 10Jeremyb) [07:45:01] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [07:48:42] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 1039 MB (3% inode=99%): [07:49:39] <_joe_> again? [07:50:00] <_joe_> OOOOH I LOVE o-c-gee [07:56:35] _joe_: will hopefully get better once https://bugzilla.wikimedia.org/show_bug.cgi?id=68576 is fixed [07:57:02] (of course not a solution to breathe, but some additional apnea time) [08:01:18] <_joe_> Nemo_bis: I hope so :) [08:01:37] <_joe_> Nemo_bis: not sure it's just that anyway [08:02:46] when the PDF for an article is in the hundreds MB, it's easy to consume 1.5 GB [08:03:20] <_joe_> Nemo_bis: https://bugzilla.wikimedia.org/show_bug.cgi?id=71647 [08:03:28] <_joe_> 32 GB is a wee bit more [08:03:45] lol [08:04:42] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:05:07] <_joe_> Nemo_bis: always the same article btw [08:05:27] <_joe_> is there a way to disable pdf printing for that article specifically? [08:05:35] <_joe_> as a stopgap solution at least [08:05:57] Generation of the document file has failed. Status: ENOSPC, write [08:06:02] RECOVERY - Disk space on ocg1001 is OK: DISK OK [08:06:08] <_joe_> !log cleaned ocg1001, again [08:06:14] Logged the message, Master [08:06:29] _joe_: I'm not sure, but we can try putting the whole article in class noprint [08:06:58] <_joe_> ok, I don't think my account has any privileges assigned [08:07:49] No privilege would help :) [08:10:14] <_joe_> Nemo_bis: oh ok "class noprint" as css class :) [08:10:33] <_joe_> sorry [08:11:49] <_joe_> I was also looking at {{hide in print}} [08:11:57] <_joe_> but it's a bit invasive [08:14:13] That template has stopped working years ago [08:14:35] noprint "works" https://en.wikipedia.org/w/index.php?title=Information_and_communication_technologies_for_development&printable=yes dunno if OCG respects it [08:23:03] Apache2 on stat1001 is not running (hence stats.wikimedia.org and datasets.wikimedia.org don't work. RT 8554). [08:23:06] Could someone with Ops powers please restart it? [08:35:06] qchris: looking [08:35:12] godog: Thanks! [08:41:31] \o/ [08:41:41] Thanks godog! [08:42:10] qchris: done, the ssl chained certificate was concatenated incorrectly, it might be put back by puppet though, looking for what it caused [08:42:18] why rather than what [08:42:24] qchris: no problem :) [08:42:59] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 0 MB (0% inode=99%): [08:44:11] <_joe_> godog: it will [08:44:28] <_joe_> ^^ GRRRRRR [08:45:09] ah yeah the stats.wikimedia.crt is missing the trailing newline, oops! [08:47:04] <_joe_> my rage is for ocg [08:47:18] <_joe_> I'll put on a script that cleans it [08:47:33] Ok. Gonna prepare a patch for the crt. [08:47:36] <_joe_> automatically while we wait for the bug to be fixed [08:48:39] qchris: heh it should be in the part that generates chains IMO [08:49:03] _joe_: yeah we'll need to fix in puppet [08:49:11] Ok. [08:49:19] <_joe_> godog: again, I was speaking of ocg [08:51:17] (03PS1) 10QChris: End stats.wikimedia.org certificate in newline [puppet] - 10https://gerrit.wikimedia.org/r/164914 (https://bugzilla.wikimedia.org/71686) [08:51:47] godog: Even if might need a proper fix in the part that generates the chains, would not [08:51:48] https://gerrit.wikimedia.org/r/164914 [08:52:01] fix the issue until that fix has been written? [08:55:32] qchris: yep, going to merge it [08:55:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] End stats.wikimedia.org certificate in newline [puppet] - 10https://gerrit.wikimedia.org/r/164914 (https://bugzilla.wikimedia.org/71686) (owner: 10QChris) [08:56:20] Thanks! [08:58:00] Should I file a separate RT ticket about the "certificate chaining making sure that the certificates end in newline", or [08:58:07] will that get scheduled in a different way? [09:00:48] qchris: we can reuse the existing RT no worries [09:00:54] Ok. [09:01:15] Thanks godog and _joe_! \o/ [09:01:27] nice, does that mean I'll stop seeing certificat warnings for stats.wm.o? :) [09:02:28] (03PS1) 10Mark Bergsma: Stop sending mail to the IMAP server (sanger) [puppet] - 10https://gerrit.wikimedia.org/r/164916 [09:03:50] qchris: np, just updated the RT [09:04:44] Nemo_bis: yep I'm not seeing warnings ATM [09:04:54] (03CR) 10Mark Bergsma: [C: 032] Stop sending mail to the IMAP server (sanger) [puppet] - 10https://gerrit.wikimedia.org/r/164916 (owner: 10Mark Bergsma) [09:07:45] !log Stopped dovecot on sanger [09:07:51] Logged the message, Master [09:08:10] wow really [09:08:13] RIP [09:08:24] (03PS1) 10Gergő Tisza: Add tracking categories for files with attribution problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164918 [09:10:02] * mark does one last rsync [09:10:11] because imap to gmail still works shit [09:11:26] hrm my transfer speed from sanger is also shit right now [09:11:42] 260 kb/s, wtf is going on [09:12:03] cogent :P [09:22:28] RECOVERY - Disk space on ocg1001 is OK: DISK OK [09:27:12] <_joe_> !log cleaned ocg another time [09:27:16] Logged the message, Master [09:38:34] Did noc.wikimedia.org SSL cert get fixed in some way? https://bugzilla.wikimedia.org/show_bug.cgi?id=64483 [09:38:39] Would be nice for closure to know what fixed it [09:45:21] <_joe_> Krinkle: I guess being moved off of fenari? [10:26:58] (03PS16) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [10:27:57] (03CR) 10Krinkle: [C: 031] "Applied locally on integration-puppetmaster.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [10:29:51] (03PS1) 10Reza: Enable "import" on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164933 (https://bugzilla.wikimedia.org/71681) [10:33:59] (03PS17) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [10:36:49] (03PS18) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [10:36:53] (03CR) 10Krinkle: "Fixed "Duplicate declaration: Package[xvfb] is already declared in file /etc/puppet/modules/contint/manifests/browsers.pp:12; cannot redec" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [10:43:13] (03PS2) 10Reza: Administrators can import by upload; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164933 (https://bugzilla.wikimedia.org/71681) [11:07:08] (03PS2) 10Giuseppe Lavagetto: swift: clean up [puppet] - 10https://gerrit.wikimedia.org/r/164566 [11:07:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] swift: clean up [puppet] - 10https://gerrit.wikimedia.org/r/164566 (owner: 10Giuseppe Lavagetto) [11:11:11] !log Shutdown sanger [11:11:17] Logged the message, Master [11:11:53] !log Shutdown tarin [11:11:58] Logged the message, Master [11:12:51] PROBLEM - Host sanger is DOWN: PING CRITICAL - Packet loss = 100% [11:13:42] PROBLEM - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% [11:14:43] <_joe_> kill another one!' [11:17:24] there's nothing left [11:18:36] akosiaris: what's the status of the netapp? [11:19:02] that and the toolserver disk array are the only non-network infrastructure I still see up [11:21:32] (03PS1) 10Giuseppe Lavagetto: swift_new: order output of template [puppet] - 10https://gerrit.wikimedia.org/r/164939 [11:23:47] (03CR) 10Giuseppe Lavagetto: [C: 032] swift_new: order output of template [puppet] - 10https://gerrit.wikimedia.org/r/164939 (owner: 10Giuseppe Lavagetto) [11:27:46] (03PS1) 10Mark Bergsma: Remove all IMAP configuration and Puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/164940 [11:28:14] _joe_, fyi, ocg is still borked [11:28:22] pdf generation not kicking off .. [11:28:37] <_joe_> Eloquence: I was merely cleaning the tmpfs [11:28:50] <_joe_> I'll try to get something from the logs, thanks [11:28:58] <_joe_> I'm definitely not an ocg expert :) [11:29:10] new PDFs stall at "Waiting for job runner" [11:29:23] might need cscott_away to dig into it [11:29:40] <_joe_> I assumed that since it was keeping filling up its tmpfs [11:29:44] <_joe_> it was indeed working [11:30:13] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 683 MB (2% inode=99%): [11:30:51] <_joe_> you called it :P [11:32:02] <_joe_> Eloquence: I'm not sure how the jobs are controlled for ocg [11:32:14] <_joe_> it may even have to do with jobrunners as in php onese [11:32:18] <_joe_> *ones [11:32:48] <_joe_> I'll dig in a little, for what it's worth ocg1001 was still logging as if it was working perfectly [11:33:51] *nod* Scott should be up in about 5 hours or so [11:34:18] Eloquence: And you are awake because... :P [11:34:19] <_joe_> !log rolling restart and cleaning of ocg nodes, trying to unlock pdf generation [11:34:25] RoanKattouw, in berlin :) [11:34:26] Logged the message, Master [11:34:35] RECOVERY - Disk space on ocg1001 is OK: DISK OK [11:35:00] Oooh [11:35:14] (03PS2) 10Mark Bergsma: Remove all IMAP configuration and Puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/164940 [11:35:14] * RoanKattouw is also in Europe this week [11:35:23] * GroggyPanda slowly wakes up [11:35:35] <_joe_> hi GroggyPanda [11:35:39] eloquence has started his vacation ;-) [11:35:43] <_joe_> where are you in the globe this week? [11:35:49] _joe_: Chennai, India! [11:36:07] I was about to ask, did I miss this in This Week In Engineering, but it's Monday morning so it hasn't been sent out yet [11:36:10] next week - Pondicherry, India. Then the foot of the himalayas! (apparently there is decent 3G there) [11:36:49] _joe_: btw, I've finished the series of patches that will kill misc/icinga.pp :) just need merges now, and then I can refactor some more. [11:36:51] <_joe_> Eloquence: I guess ocg1001 gets most of the jobs (I am checking why) and it got completely stuck whenever the disk fills up [11:37:14] <_joe_> meaning that, probably, it was appearing as working, but not really doing anything [11:38:07] <_joe_> I started a job on the incriminated page, and it's working [11:38:50] (03CR) 10Mark Bergsma: [C: 031] Remove all IMAP configuration and Puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/164940 (owner: 10Mark Bergsma) [11:39:38] * GroggyPanda goes to hunt for food [11:42:41] (03CR) 10Mark Bergsma: [C: 04-1] "This change seems fine by itself, but would break, due to the fact that exim4-daemon-heavy is not (reliably) staying installed, it's repla" [puppet] - 10https://gerrit.wikimedia.org/r/164386 (owner: 10Mark Bergsma) [11:49:52] <_joe_> !log done restarting ocg servers [11:49:58] Logged the message, Master [11:51:42] <_joe_> Eloquence: still getting reports of pdf generation not working? [11:53:43] <_joe_> I did print a couple of pages and it was almost-instantaneous [12:09:09] _joe_, works again for now, thanks [12:13:13] <_joe_> we should keep an eye on cpu/load on those servers as an indicator, while problems get sorted out. [12:14:24] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:34] <_joe_> bbl [12:18:14] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [12:33:22] (03PS1) 10Filippo Giunchedi: codfw: add missing machines [software/swift-ring] - 10https://gerrit.wikimedia.org/r/164951 [12:34:32] (03PS2) 10Filippo Giunchedi: codfw: add missing machines to the ring [software/swift-ring] - 10https://gerrit.wikimedia.org/r/164951 [12:34:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw: add missing machines to the ring [software/swift-ring] - 10https://gerrit.wikimedia.org/r/164951 (owner: 10Filippo Giunchedi) [12:46:17] "This Week In Engineering", who knew [13:14:53] (03PS1) 10Filippo Giunchedi: swift: add codfw monitoring and dashboards [puppet] - 10https://gerrit.wikimedia.org/r/164957 [13:18:18] <_joe_> back [13:39:41] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:44:38] (03CR) 10Mark Bergsma: [C: 031] "Yes." [puppet] - 10https://gerrit.wikimedia.org/r/159441 (owner: 10Dzahn) [13:45:49] mark: I'm not actually travelling - just recovering from jetlag :) Travelling next week, but will still be working during the time [13:46:31] !log starting test swiftrepl run on wikibooks eqiad -> codfw [13:46:37] Logged the message, Master [13:50:47] (03PS2) 10Giuseppe Lavagetto: hiera: use hiera to lookup the cluster [puppet] - 10https://gerrit.wikimedia.org/r/164567 [13:52:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:52:44] <_joe_> godog: \o/ [13:53:43] \o/ fingers crossed [14:10:22] (03PS1) 10Filippo Giunchedi: codfw: fix container.builder port [software/swift-ring] - 10https://gerrit.wikimedia.org/r/164975 [14:10:29] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: use hiera to lookup the cluster [puppet] - 10https://gerrit.wikimedia.org/r/164567 (owner: 10Giuseppe Lavagetto) [14:10:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw: fix container.builder port [software/swift-ring] - 10https://gerrit.wikimedia.org/r/164975 (owner: 10Filippo Giunchedi) [14:11:08] <_joe_> godog: can I merg your change as well? [14:11:44] _joe_: sure, which change btw? [14:11:56] <_joe_> software/swift-ring sorry [14:11:58] <_joe_> :) [14:12:25] hah last one, yeah not in puppet [14:12:26] andrewbogott: hey! still battling ldap? [14:12:39] GroggyPanda: no, all done I think [14:12:52] andrewbogott: w00t. wanna merge some more icinga changes? :) [14:13:25] current set of 'em just moves everything out of misc/icinga.pp and then kills the file, while also removing some useless code. will need to do more refactoring over time. [14:13:32] only 5 patches! [14:15:02] (03PS2) 10Giuseppe Lavagetto: puppet: drop global variable $puppet_version [puppet] - 10https://gerrit.wikimedia.org/r/164568 [14:15:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet: drop global variable $puppet_version [puppet] - 10https://gerrit.wikimedia.org/r/164568 (owner: 10Giuseppe Lavagetto) [14:15:18] GroggyPanda: I still need my breakfast but will try to catch up in a bit [14:15:31] andrewbogott: ah ok :) [14:15:54] godog: no support for Full-hd on wikimedia ? [14:16:15] the highest i get is 720p [14:16:34] <_joe_> we are for low-emission videos [14:17:00] matanya: no idea, sorry :) [14:17:03] <_joe_> good monday :) [14:17:36] hello both :) sorry for low rate of commits, and high rate of questions [14:26:44] * GroggyPanda waves at matanya [14:27:08] * matanya waves back GroggyPanda [14:27:20] * GroggyPanda hopes to have degrogged in a few days [14:27:37] matanya: misc/icinga.pp is almost dead :) 5 patches left, 4th one rms it :) [14:27:42] <_joe_> GroggyPanda: I am thinking that pandas look groggy in general [14:27:52] _joe_: heh :D true! [14:27:59] although I haven't seen a 'real' one in person [14:28:04] GroggyPanda: i'm closely following :) [14:28:11] I've also looked/felt groggy for the last week anyway [14:30:55] <_joe_> GroggyPanda: next step in refactoring... https://wikitech.wikimedia.org/wiki/Puppet_Hiera [14:31:06] (03PS2) 10Filippo Giunchedi: swift: add codfw monitoring and dashboards [puppet] - 10https://gerrit.wikimedia.org/r/164957 [14:31:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add codfw monitoring and dashboards [puppet] - 10https://gerrit.wikimedia.org/r/164957 (owner: 10Filippo Giunchedi) [14:31:21] _joe_: hmm, I think next step in refactoring is killing a lot of 'dead' code [14:31:28] or verifying that it isn't dead code [14:31:35] <_joe_> yeah, that [14:31:36] _joe_: fun fact - there's still an exec that recursively changes permissions [14:31:43] <_joe_> I thought you were already doing that [14:31:47] because everywhere everything is set to root:root [14:31:52] <_joe_> oh yes that looks funny [14:32:08] _joe_: yeah, that's what I'm doing now. I'm wary because I don't have access to the machine and always need to bug someone about it... [14:32:20] but hopefully I'll snag someone today :) [14:32:51] <_joe_> when are you going to be invested with root access? [14:32:54] _joe_: yeah, also the individual .cfg files for check plugins, lot of them are unused. they just came with the package, and somehow ended up in puppet [14:32:55] <_joe_> do we have a date? [14:33:08] _joe_: next month, no date yet. I guess first week. [14:33:13] let me email mar.k and tomasz [14:33:14] <_joe_> you know there is that test before you can have root [14:33:37] by test you mean 'hazing' right? :) [14:33:48] <_joe_> where we verify if you have forgot everything about javascript, or we won't let you near a console [14:33:49] verify you can issue reboot on the right machine ? [14:34:02] _joe_: hahaha :D Sadly I'll fail that one... [14:34:09] * GroggyPanda wrote JS as recently as two days ago [14:34:09] common ops failure [14:34:14] <_joe_> GroggyPanda: that was tailored to you [14:34:20] hehe [14:34:28] <_joe_> GroggyPanda: I have written a couple npm modules as well [14:34:36] <_joe_> which are horrible, btw [14:34:40] ah, but I do worse. I write *client side* JS [14:34:53] and CSS as well [14:34:59] <_joe_> you got that wrong [14:35:08] <_joe_> ops don't care about client-side JS [14:35:12] haha :) [14:35:20] <_joe_> as long as that thing stays away from our servers [14:35:23] _joe_: a fun must read: http://www.cyberciti.biz/tips/my-10-unix-command-line-mistakes.html [14:35:46] the comments section, mostly [14:36:08] _joe_: :) apparently we've standardized on nodejs for all our 'services', so I don't think it is going away anytime soon [14:36:36] _joe_: I hope to move by Nov 3. Emailed the powers that be. [14:36:51] hi papaul [14:36:55] hi [14:38:49] _joe_: btw, your hiera change for contacts.cfg requires that the current contents of contacts.cfg be ported to hiera for it to work... [14:39:07] and I don't have access to contacts.cfg in prod :) :( [14:40:58] (03CR) 10Manybubbles: "Any worry about getting enough translations before turning this on?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164918 (owner: 10Gergő Tisza) [14:44:14] <_joe_> GroggyPanda: I know, on it in ~ 10 mins [14:44:21] w00t, thanks [14:46:08] manybubbles, marktraceur, ^demon|away: So which of us wants to SWAT today? [14:46:23] anomie:, ^demon|away, marktraceur: I can do it! [14:46:27] ok [14:46:32] <^demon|away> okie dokie [14:47:35] * aude here :) [14:50:10] aude: thanks! [14:51:03] !log disconnecting Tampa servers [14:51:09] Logged the message, Master [14:53:51] (03PS2) 10PleaseStand: Remove obsolete flags (all of them) from $wgAntiLockFlags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164012 [14:57:15] (03CR) 10PleaseStand: [C: 04-1] "Do not deploy until 1.25wmf3 is deployed to ALL wikis (Thursday, 16 October 2014)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164012 (owner: 10PleaseStand) [14:58:19] (03CR) 10Gergő Tisza: "No, that should be OK. It's only deployed to non-English projects on Thursday, and often tracking category names are only translated when " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164918 (owner: 10Gergő Tisza) [15:00:04] godog: Can you clarify https://rt.wikimedia.org/Ticket/Display.html?id=8554 ? I can't tell if you think it's resolved or that there's still an issue there. [15:00:05] manybubbles, anomie, ^d, marktraceur, aude: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141006T1500). [15:00:25] aude: you are going first [15:00:43] it announced me ? [15:00:49] ready [15:01:20] aude: I didn't realize you were on the list of SWATers too [15:01:22] cool [15:02:17] tgr: around to verify you SWAT deploy? [15:02:19] i am ? (wouldn't mind) [15:03:09] manybubbles: yes, but there isn't much to verify, it's an 1.25wmf2 patch [15:03:13] aude: the bot mentioned you. not sure why. maybe its because you are in the line in the table [15:03:16] aude: You're not on the list; talk to greg-g if you want to be. I have no idea why jouncebot included you this morning. [15:03:22] (03CR) 10Manybubbles: [C: 032] Add tracking categories for files with attribution problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164918 (owner: 10Gergő Tisza) [15:03:26] ok :) [15:04:32] zuul is stuck! [15:04:44] hashar: can kick zuul? [15:04:44] aude: I think because next week we'll need your help covering the three deploys [15:05:08] ah, yes more swat times? [15:05:14] manybubbles: I'll do it, one moment [15:05:31] for the amount of stuff we put in swat, think we should help out more [15:05:31] andrewbogott: thanks! [15:06:00] aude: you are the easiest ones to deploy because you always make your submodule update and I trust it so I don't go reviewing what is inside it [15:06:46] manybubbles: yes [15:06:47] ok [15:06:55] manybubbles: will fix it up [15:06:58] andrewbogott: will fix it up [15:07:13] hashar: I'm doing a restart and it's hanging :( [15:07:19] So… over to you! [15:07:24] andrewbogott: yeah that is normal it wait for jobs to complete [15:07:27] will try something :] [15:07:31] if that works will update whatever doc [15:08:11] bd808|BUFFER, GroggyPanda, greg-g, Reedy: Any idea why jouncebot decided to ping aude for this morning's SWAT? She's not (yet) on the "who" list. [15:08:26] stupid Jenkins [15:08:33] hmm, no idea where it picks its list from... [15:08:50] I could possibly take a look at the code... [15:09:38] not a big deal [15:14:13] andrewbogott: yup, done [15:14:18] (03PS1) 10Hashar: zuul: make init restart command friendlier [puppet] - 10https://gerrit.wikimedia.org/r/164988 [15:15:19] so zuul jobs are completing [15:15:21] then it should reload [15:15:42] (03PS3) 10Giuseppe Lavagetto: nagios_common: use a template for contacts. [puppet] - 10https://gerrit.wikimedia.org/r/164301 [15:17:11] * hashar waits for the super long jobs [15:19:21] !log ran 'sudo -u ocg -g ocg nodejs-ocg scripts/run-garbage-collect.js -c /home/cscott/config.js' from /home/cscott/ocg/mw-ocg-service in order to clear caches (working around https://gerrit.wikimedia.org/r/164644 ) on ocg100x.eqiad.wmnet [15:19:30] Logged the message, Master [15:19:59] greg-g: i'll be looking for a deploy window asap to deploy the actual fix in https://gerrit.wikimedia.org/r/164644 for ocg [15:20:29] it looks like swat is going on right now? [15:20:29] (03Merged) 10jenkins-bot: Add tracking categories for files with attribution problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164918 (owner: 10Gergő Tisza) [15:20:35] the inability of zuul to exploit parallelism or eliminate redundancy is ridiculous [15:20:45] one job left in Zuul [15:20:58] aude, tgr, manybubbles, anomie, marktraceur: what's the status of swat? [15:21:04] or is that what you are working on, hashar? [15:21:06] waiting on jenkins [15:21:10] I am [15:21:10] waaaaaaaaaaaaaaaaiting [15:21:14] will document it [15:21:16] cscott: yeah. just finished [15:21:16] anomie: jouncebot pings the folks in the "deployer" column and the first user in the "changes" column as I recall. I see Katie's nick in the changes column for today's morning swat. [15:21:26] zuul is being restarted and waits for a job to complete [15:21:29] manybubbles: ok, thanks. [15:21:29] bd808: fine with me [15:21:36] i've still got to prepare my deploy commit anyway. [15:21:38] swatees should be pinged also, imho [15:22:05] bd808: That explains it. Seems odd to do one instead of all or none of the "changes" people though. [15:22:13] !log manybubbles Synchronized wmf-config/: SWAT Add tracking categories for files with attribution problems (duration: 00m 06s) [15:22:14] tgr: ^^^^ [15:22:18] Logged the message, Master [15:22:19] !log swiftrepl replicating non-sharded originals containers eqiad -> codfw [15:22:25] Logged the message, Master [15:22:26] andrewbogott: can you confirm on your terminal that zuul reloaded properly ? Apparently it did [15:22:31] I think it's a parser problem. Pinging all would be best imo [15:22:41] !log Zuul jobs proceeding again [15:22:46] Logged the message, Master [15:23:52] !log manybubbles Synchronized php-1.25wmf2/extensions/Wikidata/: SWAT update wikidata (duration: 00m 10s) [15:23:53] aude: ^^^^^ [15:23:57] yay [15:23:57] Logged the message, Master [15:24:11] * aude verifies [15:24:15] aude: let me know when you think wmf2 is good and I'll +2 wmf1 [15:24:16] yeah, that [15:24:17] ori: if I had the time to set up a few other jenkins boxes, we would have a bit more of redundancy :D [15:24:20] manybubbles: thanks, works on mw.org [15:24:33] hashar: yeah, but the queue is processed synchronously [15:24:34] tgr: perfect! [15:24:38] doesn't work on beta, but whatever, I'll figure that out some other time [15:24:42] seems good [15:24:52] tgr: yeah - thats usually its own thing [15:24:55] aude: cool [15:25:10] ori: possibly. On restart Zuul stop processing new events and wait for jobs to complete before restarting [15:25:14] hashar: and a lot of jobs are dupes; when there's no change from the last patchset to +2 it should just reuse the test results from the test for the gate-and-submit [15:25:58] ori: yeah possibly though when you combine multiple repos together it is hard to know whether it is going to test the same [15:26:01] that would slash load and latency in about half [15:26:17] ori: though when one +2 a change we can probably cancel jobs for that change in the other pipelines [15:26:28] don't cancel them, since they're in progress [15:26:36] just use the results for gate-and-submit [15:26:56] that is not really going to work though when you have multiple repositories in the gate [15:27:25] if you test a change to a mediawiki extension, it is done with the tip of mediawiki branch [15:27:38] anomie: It's a bug in the scraping code -- https://github.com/wikimedia/wikimedia-bots-jouncebot/blob/master/deploypage.py#L77 -- That xpath does not match Gergő today because there is a

around his section inside the [15:27:52] but on +2 , the mediawiki extension change enter the gate, and there might be a mediawiki/core change ahead of it that might change the extension patch test result [15:27:56] anomie: no idea why jouncebot pinged katie [15:28:04] (03PS3) 10Reza: Enable "import" on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164933 (https://bugzilla.wikimedia.org/71681) [15:28:09] * hashar nature's call [15:30:09] (03PS4) 10Giuseppe Lavagetto: nagios_common: allow use of a template for contacts. [puppet] - 10https://gerrit.wikimedia.org/r/164301 [15:30:12] <_joe_> GroggyPanda: ^^ [15:30:24] <_joe_> GroggyPanda: this is both safe and allows moving to hiera [15:30:32] <_joe_> which I will do in prod shortly :) [15:30:55] anomie: ah, I see you and bd808 already discussed it :) [15:31:56] looking [15:32:13] ok, starting a deploy of ocg. [15:32:38] <_joe_> cscott: \o/ [15:32:38] _joe_: makes sense [15:32:48] <_joe_> GroggyPanda: ok, merging [15:33:05] (03CR) 10Giuseppe Lavagetto: [C: 032] nagios_common: allow use of a template for contacts. [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [15:33:48] (03PS1) 10BryanDavis: Make extraction more tolerant of HTML changes [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/164992 [15:37:18] !log manybubbles Synchronized php-1.25wmf1/extensions/Wikidata/: SWAT update wikidata (duration: 00m 10s) [15:37:19] aude: ^^^^ [15:37:22] yay [15:37:23] Logged the message, Master [15:38:09] ori: btw, is rcstream considered 'production' now? or is it on hold until HAT comes through? [15:38:18] it's production [15:38:25] https://www.wikidata.org/wiki/Q72 works again! [15:38:31] l'production? c'est rcstream. [15:38:33] and the other things look fixed [15:38:47] not the memcached traffic graph! :P [15:39:22] memcached what? [15:40:05] aude: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=Memcached+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [15:40:20] the dip in october is when you guys fixed the site group bug [15:40:24] woooah [15:40:24] but it climbed back up [15:40:40] and wikidata/wikibase predominates that traffic [15:40:46] hashar: it says … waiting …… for many lines and the finally returned. I take that to be good news. [15:41:00] it's hurting site perf, you can see it near the top of ?forceprofile=true for pages [15:41:11] yikes [15:41:17] <_joe_> ori: hi [15:41:28] hi _joe_ [15:41:43] <_joe_> you calendar was for friday btw [15:41:57] <_joe_> but, whenever you're ready, we can open the gates :) [15:41:58] i know, i just realized it [15:42:15] <_joe_> ori: all techies struggle with calendars [15:42:24] andrewbogott: yeah the jobs pending eventually completed. [15:42:55] it's as if something got accidentally reverted [15:43:14] * aude checks [15:43:21] <_joe_> ori: when you know how to use calendars, you've become a manager [15:44:29] isn't a manager supposed to hold changes on Fridays? [15:44:30] _joe_: mark is apparently still working on that, then [15:44:30] aude: it's the sitegroup stuff. i emailed wikidata@ about it [15:45:00] (love you, mark ;) ) [15:45:14] !log deleted graphite data for deployment-rsync02 by hand on labmon1001, since instance has been dead. Need to move to shinken + dynamic host.cfg [15:45:18] ori: ok [15:45:19] Logged the message, Master [15:46:33] (03PS2) 10Rush: phabricator - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/164897 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [15:46:49] we were supposed to be putting badges in parser cache and using that [15:47:05] though when i look at a wikipedia page, they seem to have badges without purging [15:47:05] (03CR) 10Rush: [C: 032 V: 032] phabricator - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/164897 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [15:47:20] seems odd to me, as thought we were going to need those pages purged [15:47:31] for the badges to show up now [15:47:53] anyway, looking [15:48:40] <_joe_> greg-g: he's resisting the metamorphosis. [15:49:14] (03PS1) 10BBlack: Add check_ifstatus_nomon and use it for routers [puppet] - 10https://gerrit.wikimedia.org/r/164999 [15:51:14] _joe_: resistance is... well, you know [15:52:15] <_joe_> greg-g: I managed to resist to managerialization pretty well for more than two years :) [15:52:24] I failed :) [15:52:57] <_joe_> greg-g: so you learned to use a calendar AND excel? [15:53:22] oh god no [15:54:00] !log updated OCG to version aee3712b352f51f96569de0bcccf3facf654e688 [15:54:07] Logged the message, Master [15:54:14] _joe_: pfft, it's greg-g. It's a calendar and OpenCalc :) [15:54:29] or libreoffice calc [15:55:29] gnumeric [15:56:06] hah [15:56:26] we should setup an ethercalc somewhere on labs [15:57:17] oh man, it's in another Javascript variant [15:57:46] wait, and it uses redis for *storage*?!!!?! [15:57:46] sigh [16:00:21] (03CR) 10Andrew Bogott: [C: 032] icinga: Move purge resource script into module [puppet] - 10https://gerrit.wikimedia.org/r/164676 (owner: 10Yuvipanda) [16:01:58] (03PS3) 10Manybubbles: Install new plugin and upgrade another [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/164633 [16:03:00] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [16:03:05] (03CR) 10Manybubbles: [C: 04-1] "Haven't uploaded these to Archiva yet but will later today." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/164633 (owner: 10Manybubbles) [16:03:43] ori: trying with the deployed version (and master), dont see where we are accessing site store / memcached on page views [16:03:50] wikipedia or wikidata [16:03:57] but do see with forceprofile=true [16:04:09] <_joe_> puppet on palladium did not fail at all... [16:04:33] also notice http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=Memcached+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report dropping again [16:04:50] (03PS3) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/163579 [16:04:52] (03CR) 10Andrew Bogott: [C: 032] icinga: Move misc files/dirs into module [puppet] - 10https://gerrit.wikimedia.org/r/164678 (owner: 10Yuvipanda) [16:05:08] need to investigate more [16:05:25] yeah [16:06:32] * aude off to eat, maybe back later although i am tired [16:07:08] * ori waves [16:08:15] GroggyPanda: time for https://gerrit.wikimedia.org/r/#/c/162873/? [16:08:25] chasemp: ssure [16:08:42] (03PS9) 10Rush: Abstracts Sprint install with defined resource type phabricator::libext [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [16:08:51] (03CR) 10Rush: [C: 032 V: 032] Abstracts Sprint install with defined resource type phabricator::libext [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [16:09:01] chasemp: let me know when you've puppet merged it, [16:09:04] I'll force ar un [16:09:05] *run [16:09:12] ok merged, see if it breaks your stuffs [16:09:34] never used their extension but it looks ok so hopefully [16:10:32] (03PS1) 10Ori.livneh: Set $wgPercentHHVM to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165004 [16:11:04] (03PS3) 10Andrew Bogott: icinga: Remove misc/icinga.pp [puppet] - 10https://gerrit.wikimedia.org/r/164679 (owner: 10Yuvipanda) [16:12:06] (03CR) 10Giuseppe Lavagetto: [C: 031] Set $wgPercentHHVM to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165004 (owner: 10Ori.livneh) [16:12:27] chasemp: seems ok so far... [16:12:41] (03CR) 10Andrew Bogott: [C: 032] icinga: Remove misc/icinga.pp [puppet] - 10https://gerrit.wikimedia.org/r/164679 (owner: 10Yuvipanda) [16:12:51] (03PS4) 10Andrew Bogott: icinga: Move checkcommands.erb into module [puppet] - 10https://gerrit.wikimedia.org/r/164681 (owner: 10Yuvipanda) [16:13:18] GroggyPanda: is this phab-01? [16:13:21] you are testing on [16:13:21] yeah [16:13:23] k [16:13:45] chasemp: https://phab-01.wmflabs.org/config/issue/config.unknown.phabricator.show-prototypes/ [16:13:47] (03CR) 10Ori.livneh: [C: 032] Set $wgPercentHHVM to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165004 (owner: 10Ori.livneh) [16:13:56] <_joe_> woo [16:13:59] (03Merged) 10jenkins-bot: Set $wgPercentHHVM to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165004 (owner: 10Ori.livneh) [16:14:51] GroggyPanda: they changed beta to prototypes the phab version there must be old [16:15:02] old enough to not know about the change [16:15:05] probably best just to update? [16:15:13] does a puppet run warn about tag out of sync? [16:15:32] nope, it doesn't... [16:15:41] chasemp: sorry, got distracted by a phone call. brb? [16:15:51] sure [16:16:45] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:19:50] !log ori Synchronized wmf-config/CommonSettings.php: Set $wgPercentHHVM to 1 (duration: 00m 27s) [16:19:58] Logged the message, Master [16:20:05] (03CR) 10Andrew Bogott: [C: 032] icinga: Move checkcommands.erb into module [puppet] - 10https://gerrit.wikimedia.org/r/164681 (owner: 10Yuvipanda) [16:20:06] _joe_: ^ [16:20:21] <_joe_> ok [16:20:44] (03PS4) 10Andrew Bogott: icinga: Remove unused check plugins [puppet] - 10https://gerrit.wikimedia.org/r/164682 (owner: 10Yuvipanda) [16:20:52] _joe_: the value is echoed to javascript via the startup module, which is cached for five mins for anons [16:20:57] _joe_: so the impact may be delayed [16:22:13] RECOVERY - puppet last run on pc1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:22:21] (03CR) 10BryanDavis: [C: 031] Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [16:22:41] Ops doens't run any blacklist of IPs to get 503s from all Wikipedias, do they? [16:24:35] _joe_: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=HHVM+Appservers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [16:25:03] seems sane [16:25:07] <_joe_> yeah I'm watching that better that a CL match [16:25:24] CL? [16:25:29] (03CR) 10Andrew Bogott: [C: 032] icinga: Remove unused check plugins [puppet] - 10https://gerrit.wikimedia.org/r/164682 (owner: 10Yuvipanda) [16:25:32] oh, champions league [16:25:54] <_joe_> ori: yeah sorry [16:26:08] GroggyPanda: that everything? [16:26:09] <_joe_> traffic is arriving, no shit. [16:27:41] RECOVERY - puppet last run on pc1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:28:50] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [16:30:06] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Puppet has 1 failures [16:30:26] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Puppet has 1 failures [16:30:57] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:07] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:16] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:17] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:36] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:42] andrewbogott, GroggyPanda: are you sure about those icinga checks being unused? [16:31:46] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:46] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Puppet has 1 failures [16:32:03] ori: looking... [16:32:07] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Puppet has 1 failures [16:33:17] RECOVERY - puppet last run on pc1003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:33:17] (03CR) 10Andrew Bogott: "On db1069:" [puppet] - 10https://gerrit.wikimedia.org/r/164682 (owner: 10Yuvipanda) [16:33:23] (03PS1) 10Andrew Bogott: Revert "icinga: Remove unused check plugins" [puppet] - 10https://gerrit.wikimedia.org/r/165007 [16:34:21] PROBLEM - puppet last run on es1002 is CRITICAL: CRITICAL: Puppet has 1 failures [16:34:30] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Puppet has 1 failures [16:34:40] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Puppet has 1 failures [16:34:45] (03CR) 10Andrew Bogott: [C: 032] Revert "icinga: Remove unused check plugins" [puppet] - 10https://gerrit.wikimedia.org/r/165007 (owner: 10Andrew Bogott) [16:35:11] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:21] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:31] PROBLEM - puppet last run on es1010 is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:39] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:50] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:01] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:06] <_joe_> who changed the dbs? [16:36:20] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:36:22] <_joe_> andrewbogot ^^^ [16:36:24] They should recover in a minute. It was just an icinga check [16:36:26] Ah, there we go. [16:36:30] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:31] <_joe_> ok [16:37:11] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Puppet has 1 failures [16:37:20] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Puppet has 1 failures [16:37:24] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:37:30] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Puppet has 1 failures [16:37:47] (03CR) 10Andrew Bogott: [C: 032] zuul: make init restart command friendlier [puppet] - 10https://gerrit.wikimedia.org/r/164988 (owner: 10Hashar) [16:40:05] (03PS2) 10BBlack: Add check_ifstatus_nomon and use it for routers [puppet] - 10https://gerrit.wikimedia.org/r/164999 [16:43:37] (03CR) 10BBlack: [C: 032] Add check_ifstatus_nomon and use it for routers [puppet] - 10https://gerrit.wikimedia.org/r/164999 (owner: 10BBlack) [16:46:39] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:47:39] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:48:00] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:48:41] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:49:19] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:49:20] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:49:40] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:50:03] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:50:20] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:50:22] who knows about the production deployment of redis? [16:51:46] RECOVERY - puppet last run on es1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:52:36] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:53:06] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:53:17] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:53:17] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:53:26] RECOVERY - puppet last run on es1010 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:53:36] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:53:56] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:53:57] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:53:57] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:54:03] ori: where does $wgPercentHHVM come from? [16:54:22] (03PS2) 10Dzahn: remove ishmael SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164697 [16:54:26] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [16:54:30] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:54:35] (03PS2) 10Dzahn: delete contacts.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164695 [16:54:45] (03PS2) 10Dzahn: remove virt-star.pmtpa SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164696 [16:54:52] (03PS2) 10Dzahn: delete nfs[12].pmtpa SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164694 [16:55:02] (03PS2) 10Dzahn: delete sanger SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164692 [16:55:14] (03PS2) 10Dzahn: delete the noc.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164001 [16:55:20] (03PS3) 10Dzahn: remove blog SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164698 [16:55:30] (03PS2) 10Dzahn: delete snuggle.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164690 [16:55:57] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:57:37] !log git-deploy: Deploying integration/slave-scripts 0b85d48 [16:57:39] legoktm: https://github.com/search?q=wgPercentHHVM+%40wikimedia&type=Code&utf8=%E2%9C%93 [16:57:41] CommonSettings.php [16:57:43] under wmgUseEventLogging [16:57:45] Logged the message, Master [16:58:04] Krinkle: right, I mean what code is reading that variable and setting the cookie based on it [17:01:41] (03PS2) 10coren: give *Coren* Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/163853 [17:01:45] legoktm: https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/wmf/1.25wmf2/modules/ext.navigationTiming.HHVM.js [17:01:46] (03CR) 10jenkins-bot: [V: 04-1] give *Coren* Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/163853 (owner: 10coren) [17:03:23] <_joe_> !log depooling and repooling progressively hhvm appservers to do see performance under load [17:03:30] Logged the message, Master [17:03:31] (03PS6) 10coren: Autogenerate chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/163798 [17:03:33] (03PS1) 10coren: Remove obsolete explicit ca certs references [puppet] - 10https://gerrit.wikimedia.org/r/165012 [17:05:05] (03PS3) 10Andrew Bogott: give *Coren* Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/163853 (owner: 10coren) [17:12:23] (03PS1) 10Calak: Require 10 edits for autoconfirmed on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165014 (https://bugzilla.wikimedia.org/71709) [17:16:23] (03PS1) 10Jgreen: add otrs patches and add-ons [software] - 10https://gerrit.wikimedia.org/r/165016 [17:17:34] (03CR) 10Jgreen: [C: 032 V: 031] add otrs patches and add-ons [software] - 10https://gerrit.wikimedia.org/r/165016 (owner: 10Jgreen) [17:18:13] (03CR) 10coren: [C: 031] "RIP sanger" [puppet] - 10https://gerrit.wikimedia.org/r/164692 (owner: 10Dzahn) [17:19:22] (03CR) 10coren: [C: 031] "Good riddance?" [puppet] - 10https://gerrit.wikimedia.org/r/164698 (owner: 10Dzahn) [17:20:45] (03CR) 10coren: [C: 04-1] "I'm pretty sure that was requested for a Tool Labs project; let my first contact the apparent maintainers and figure out if they were just" [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [17:21:14] (03PS1) 10Jgreen: oops, realized operations/software/otrs.git is its own repo [software] - 10https://gerrit.wikimedia.org/r/165019 [17:21:21] (03CR) 10Dzahn: [C: 031] Require 10 edits for autoconfirmed on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165014 (https://bugzilla.wikimedia.org/71709) (owner: 10Calak) [17:21:27] (03CR) 10Jgreen: [C: 032 V: 031] oops, realized operations/software/otrs.git is its own repo [software] - 10https://gerrit.wikimedia.org/r/165019 (owner: 10Jgreen) [17:21:47] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:00] (03CR) 10coren: [C: 031] delete the noc.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164001 (owner: 10Dzahn) [17:22:35] (03CR) 10coren: [C: 031] delete nfs[12].pmtpa SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164694 (owner: 10Dzahn) [17:22:38] <_joe_> !log hhvm load test finished [17:22:47] Logged the message, Master [17:22:50] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 74666 bytes in 9.541 second response time [17:23:12] (03CR) 10coren: [C: 031] remove virt-star.pmtpa SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164696 (owner: 10Dzahn) [17:23:53] (03CR) 10Krinkle: [C: 031] "Cherry-picked to integration-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [17:23:55] <_joe_> and I'm off until ops meeting, but ping me here in case of errors [17:24:26] (03PS3) 10Krinkle: use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [17:24:59] (03CR) 10Krinkle: "Fixed Bug reference to be in the footer instead of in the body (similar to HTTP and E-mail headers)." [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [17:26:25] (03PS1) 10Jgreen: add git-setup script for otrs repo [software/otrs] - 10https://gerrit.wikimedia.org/r/165021 [17:30:19] (03PS1) 10Ottomata: Add Leila to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/165022 [17:31:20] (03CR) 10Ottomata: [C: 032 V: 032] Add Leila to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/165022 (owner: 10Ottomata) [17:37:15] (03CR) 10Jgreen: [C: 032 V: 032] add git-setup script for otrs repo [software/otrs] - 10https://gerrit.wikimedia.org/r/165021 (owner: 10Jgreen) [17:37:26] (03CR) 10Dzahn: [C: 031] "since the script should automatically find the CA now. nice simplification" [puppet] - 10https://gerrit.wikimedia.org/r/165012 (owner: 10coren) [17:38:17] (03PS1) 10Jgreen: move 2.4.x patches into a version-named directory [software/otrs] - 10https://gerrit.wikimedia.org/r/165023 [17:42:16] andrewbogott: thanks! :) [17:42:23] now for some more... [17:43:05] (03CR) 10Ebrahim: [C: 031] Require 10 edits for autoconfirmed on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165014 (https://bugzilla.wikimedia.org/71709) (owner: 10Calak) [17:44:05] bd808: ohhhh, it wasn't merged into master. [17:46:16] (03PS1) 10Jkrauska: Cosmetic cleanup to templates [puppet] - 10https://gerrit.wikimedia.org/r/165025 [17:51:15] (03CR) 10Dzahn: [C: 031] "i like this, especially because we often had issues in the past with chains being wrongly created because it needed a manual addition to c" [puppet] - 10https://gerrit.wikimedia.org/r/163798 (owner: 10coren) [17:52:32] (03CR) 10Alexandros Kosiaris: [C: 032] Cosmetic cleanup to templates [puppet] - 10https://gerrit.wikimedia.org/r/165025 (owner: 10Jkrauska) [17:53:02] (03CR) 10Aaron Schulz: [C: 032] Removed obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164693 (owner: 10Aaron Schulz) [17:53:16] (03Merged) 10jenkins-bot: Removed obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164693 (owner: 10Aaron Schulz) [17:56:09] (03CR) 10Andrew Bogott: "For reference: https://rt.wikimedia.org/Ticket/Display.html?id=8554" [puppet] - 10https://gerrit.wikimedia.org/r/163798 (owner: 10coren) [17:57:54] springle: have you ever considered tweaking innodb_old_blocks_time? Just curious. [17:58:54] we do have a fair amount of random API queries that hit stuff nobody cares about later I'd suspect...though I don't know if the pages are often hit twice in those cases (or just the initial time). [17:58:55] (03PS1) 10Giuseppe Lavagetto: HAT: mark failed requests with an additional header [puppet] - 10https://gerrit.wikimedia.org/r/165028 [18:02:39] bd808: I'm ready to do some beta testing of salt upgrade to 2014.1.10 this week; do you have some time available, and if so, when's good for you? [18:03:25] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:03:42] apergos: What help do you need from me? Just running trebuchet through it's paces after the update? [18:04:05] sure, and there's going to be a trick to it... [18:04:15] I can shovel the dependencies into the repo [18:04:39] but I can't really put salt-* into it because if I do so our old versions will suddenly disappear, which is not good [18:05:41] So we need to install via dpkg I guess? [18:06:35] (03PS1) 10Jgreen: add 3.2.x packages and add-ons, and 3.2.14 security patches [software/otrs] - 10https://gerrit.wikimedia.org/r/165030 [18:07:12] well that's what I've done in my local testing [18:07:21] it's been ok [18:07:30] I have the packages so that's no problem [18:07:34] (03CR) 10Krinkle: HAT: mark failed requests with an additional header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165028 (owner: 10Giuseppe Lavagetto) [18:07:48] ah what ubuntu distros do we have in beta? precise and trusty I guess? any lucid? [18:07:57] no lucid [18:08:01] hrm [18:08:13] but both precise and trusty, yes [18:08:23] salt master is precise [18:08:28] any way to spin one up jut for kicks? or do we not even have the ability any more? [18:08:37] that's good since production master is precise also [18:09:14] I don't think there is a lucid labs image [18:10:13] ah bummer [18:10:26] * apergos looks around for Coren (who is in the same meeting I am) [18:10:43] ok, I'll pick this up again after our meeting... [18:17:03] apergos: I can temporarily enable a lucid image on labs -- it's easy to do but I've no idea if it will spin up or not. It definitely will not be able to access shared labs storage due to nfs versioning issues. [18:17:22] apergos: ping me after the meeting, I'll set it up [18:17:40] (The images are there but hidden, it's trivial to unhide and rehide) [18:18:01] hmm, I wonder if it'll even let puppet run... [18:18:09] because of things in standard labs roles now? [18:18:39] Yeah, hard to predict. [18:19:05] that will be amusing (or not), heh [18:19:36] I don't care about shared anything, it just has to run some crappy verion of salt, be upgraded, and behave well afterwards with trebuchet and whatever else [18:24:40] (03PS2) 10Ori.livneh: apache_status: get stats from 127.0.0.1, not localhost [puppet] - 10https://gerrit.wikimedia.org/r/164885 [18:29:36] (03PS3) 10Ori.livneh: apache_status: get stats from 127.0.0.1, not localhost [puppet] - 10https://gerrit.wikimedia.org/r/164885 [18:30:19] _joe_: speaking of +1s.... ;) ^^ [18:31:01] <_joe_> ori: https://gerrit.wikimedia.org/r/#/c/165028/ this in exchange :) [18:31:11] * ori reviews [18:35:59] (03CR) 10Ori.livneh: "What about requests for static assets that are handled entirely by Apache? Those won't have the X-Powered-By header. Perhaps we should set" [puppet] - 10https://gerrit.wikimedia.org/r/165028 (owner: 10Giuseppe Lavagetto) [18:36:37] anybody know how to unmute(?) a service check in icinga? [18:36:38] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=stat1001 [18:36:45] dunno why it is like that. [18:37:05] ah, wait, enable notifications.. [18:37:06] .maybe. [18:37:34] PROBLEM - CI tmpfs disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 24 MB (4% inode=99%): [18:44:41] (03PS1) 10Ottomata: Add Eric Zachte to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/165032 [18:45:54] (03CR) 10Ottomata: [C: 032] Add Eric Zachte to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/165032 (owner: 10Ottomata) [18:46:31] RECOVERY - CI: Puppet failure events on labmon1001 is OK: OK: All targets OK [18:48:31] (03CR) 10Jgreen: [C: 032 V: 032] move 2.4.x patches into a version-named directory [software/otrs] - 10https://gerrit.wikimedia.org/r/165023 (owner: 10Jgreen) [18:48:42] (03CR) 10Filippo Giunchedi: [C: 031] "+1 since I requested it, and given that salt-minion is already everywhere it should be monitored. Note however that the related RT refers " [puppet] - 10https://gerrit.wikimedia.org/r/164651 (owner: 10Dzahn) [18:49:28] (03CR) 10Jgreen: [C: 032 V: 032] add 3.2.x packages and add-ons, and 3.2.14 security patches [software/otrs] - 10https://gerrit.wikimedia.org/r/165030 (owner: 10Jgreen) [18:49:40] mutante: There was just a huge spike in mobile traffic from Germany, any idea what's going on with that? [18:49:44] https://gdash.wikimedia.org/dashboards/reqmobile/ [18:51:21] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:33] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:43] PROBLEM - check configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:52] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:52] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:00] (03PS2) 10Ori.livneh: mediawiki: install `perf` on Trusty app servers [puppet] - 10https://gerrit.wikimedia.org/r/164883 [18:52:12] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:21] PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:48] kaldari: Well last night mobile traffic from Germany was occasionally dropping so likely issues are the same/related? [18:53:28] kaldari: JohnFLewis: People shortly reported connectivity problems from Germany yesterday [18:53:43] but they said it was fine again after a few minutes [18:53:58] and I couldn't reproduce, although I have the same mobile ISP [18:54:21] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 16 MB (3% inode=99%): [18:56:47] (03CR) 10Domas: [V: 031] "I'd prefer this to be installed on *, not just HHVM, always annoying not to find Perf" [puppet] - 10https://gerrit.wikimedia.org/r/164883 (owner: 10Ori.livneh) [18:57:46] (03CR) 10Faidon Liambotis: [C: 04-1] "linux-tools-generic should be sufficient, lts-trusty is some kind of transitional package. Moreover, the generic name is generic enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/164883 (owner: 10Ori.livneh) [19:00:52] (03CR) 10Giuseppe Lavagetto: [C: 031] "please yes!" [puppet] - 10https://gerrit.wikimedia.org/r/164885 (owner: 10Ori.livneh) [19:01:06] (03PS4) 10Ori.livneh: apache_status: get stats from 127.0.0.1, not localhost [puppet] - 10https://gerrit.wikimedia.org/r/164885 [19:01:15] (03CR) 10Ori.livneh: [C: 032 V: 032] apache_status: get stats from 127.0.0.1, not localhost [puppet] - 10https://gerrit.wikimedia.org/r/164885 (owner: 10Ori.livneh) [19:01:57] mutante: btw, fixed the 'UNKNOWN's in the icinga labs alerts. a bunch of instances deleted with stale metrics [19:02:21] (03PS7) 10coren: Autogenerate chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/163798 [19:02:52] (03CR) 10coren: [C: 031] "Added the forced newline after every cert per Andrew's suggestion." [puppet] - 10https://gerrit.wikimedia.org/r/163798 (owner: 10coren) [19:02:54] (03PS1) 10BBlack: Remove debugging cruft from check_ifstatus_nomon [puppet] - 10https://gerrit.wikimedia.org/r/165036 [19:03:55] (03CR) 10Giuseppe Lavagetto: "That's correct; we should probably get a better way to check if we're really using hhvm." [puppet] - 10https://gerrit.wikimedia.org/r/165028 (owner: 10Giuseppe Lavagetto) [19:03:58] !log issued cf disable and halt on nas1-a.pmtpa.wmnet nas1-b.pmtpa.wmnet. They are officially down :) [19:04:00] PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: No response from NTP server [19:04:05] Logged the message, Master [19:04:23] (03PS3) 10Ori.livneh: mediawiki: install `perf` on Trusty app servers [puppet] - 10https://gerrit.wikimedia.org/r/164883 [19:05:14] YuviPanda: like! thx [19:05:23] :D [19:05:27] andrewbogott: meeting done but I won't be around for a lot longer, it's been a long day. however I'm willing to try to spin up a random lucid instance with a junk name just to see if it comes up [19:05:31] ori: I think you missed half of my comment? [19:05:37] apergos: ok, one moment... [19:05:48] paravoid: i was just about to comment -- linux-tools-generic isn't available on precise [19:05:55] paravoid: there's linux-tools-common, but that isn't enough [19:06:14] ah it was called linux-tools before [19:06:24] ah ok [19:06:39] grumbly [19:06:44] grumble [19:06:48] (03CR) 10BBlack: [C: 032] Remove debugging cruft from check_ifstatus_nomon [puppet] - 10https://gerrit.wikimedia.org/r/165036 (owner: 10BBlack) [19:06:50] apergos: try now [19:06:57] k sec [19:07:04] paravoid: https://dpaste.de/aLdN/raw [19:07:28] right [19:07:34] apergos: let me know when the image is starting up, I'll hide that option again [19:08:00] ah iin that case let me try to spin it up in the beta project, since that is where we'll be doing testing [19:08:10] PROBLEM - CI tmpfs disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 26 MB (5% inode=99%): [19:09:33] and it's not even linux-tools-$::kernelrelease [19:09:39] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 14 MB (2% inode=99%): [19:09:44] lucid was hell setting up for mailman apergos let me just say that :p [19:10:03] I am sure it was, and I don't envy you one bit [19:10:40] It more of a hell for andrewbogott though as he had to put our keys onto the instance [19:11:17] apergos: yeah, you'll have to use your root key to access the box, your standard labs user key won't show up there. [19:11:41] …presuming you have root keys installed on labs. If not I'll have to mess with it :) [19:14:47] (03PS4) 10Ori.livneh: base::standard-packages: install `perf` [puppet] - 10https://gerrit.wikimedia.org/r/164883 [19:20:21] chasemp: hey! want me to poke around on phab-01 again to figure out what that warning was about? [19:22:00] RECOVERY - Disk space on gallium is OK: DISK OK [19:22:02] andrewbogott: since you're unlucky enough to be the ops guy this week - mind looking at https://bugzilla.wikimedia.org/show_bug.cgi?id=70943 when you have a spare moment? :) [19:40:47] (03CR) 10Ottomata: Ensure that the namenode directory exists before starting the namenode (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/164761 (owner: 10QChris) [19:42:02] (03PS2) 10Ottomata: Declare namenode directory only once [puppet] - 10https://gerrit.wikimedia.org/r/164762 (owner: 10QChris) [19:49:32] (03CR) 10Dzahn: [C: 032] "thanks Filippo. and also Ariel for comments on https://rt.wikimedia.org/Ticket/Display.html?id=8518" [puppet] - 10https://gerrit.wikimedia.org/r/164651 (owner: 10Dzahn) [20:00:05] gwicke, cscott, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141006T2000). [20:02:59] (03CR) 10Dzahn: "/File[/etc/nagios/nrpe.d/check_check_salt_minion.cfg]: Scheduling refresh of Service[nagios-nrpe-server]" [puppet] - 10https://gerrit.wikimedia.org/r/164651 (owner: 10Dzahn) [20:03:18] godog: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=salt [20:04:46] mutante: can i have your attention please? tampa rt tickets [20:05:01] matanya: whats up [20:05:08] hi: https://gerrit.wikimedia.org/r/#/c/163313/ [20:05:19] this has +2 but doesn't seem merged [20:05:47] matanya: yea, in meeting it was said we wait a few days with merging decom patches until chris actually shut them down [20:06:01] they are fine, but just wait a few days like that [20:06:21] (03PS1) 10Aklapper: Sync custom repository with what's on Bugzilla production server already. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165085 [20:06:24] so https://rt.wikimedia.org/Ticket/Display.html?id=6145 is not closed for a few days just for that, ok [20:06:37] matanya: alex does that in other cases too. he sets +2 but does not submit [20:06:52] same for the rest childerns of 6099? [20:07:11] matanya: well,.. "shutdown" is technically resolved [20:07:14] i shut it down [20:07:27] https://rt.wikimedia.org/Ticket/Display.html?id=6163 can be resolved too [20:07:56] i took 6145 [20:07:56] also https://rt.wikimedia.org/Ticket/Display.html?id=6265 if you shut down tarin (afaik, you did) [20:08:42] 6163 let's give it to akosiaris [20:08:53] and nfs1 syslog was moved to eqiad, if i follow the patches , so https://rt.wikimedia.org/Ticket/Display.html?id=7295 can go as well [20:08:54] just to confirm, but it was also said on meeting it's done [20:09:44] and https://rt.wikimedia.org/Ticket/Display.html?id=8512 is the last real blocker, if i read it correctly [20:09:48] 6163 - stolen from jkrauska, given to akos [20:10:08] * matanya is in mop mode [20:12:55] 7295 - given to akos - pending patch is here, matanya: https://gerrit.wikimedia.org/r/#/c/159442/ [20:13:00] but had a blocker on it [20:13:25] matanya: but also quote from godog " I've shut nfs1 down and stopped notifications in icinga" [20:14:29] thanks mutante [20:14:48] matanya: anything you like in https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:decom-Tampa,n,z , heh [20:15:05] those first, DNS later [20:15:41] merge from like wednesday then [20:16:03] !log deployed parsoid sha 13a53ab3 (deploy repo sha 38d44ada7) [20:16:11] Logged the message, Master [20:16:53] matanya: i'm gonna resolve "shutdown fenari" and make "decom fenari" .. so it doesnt block the tracking ticket [20:17:03] good thinking [20:20:32] (03PS1) 10Matanya: snapshot: replace pmtpa with codfw [puppet] - 10https://gerrit.wikimedia.org/r/165088 [20:21:30] any chance springle is around ? [20:21:42] matanya: in a few hours usually [20:23:50] PROBLEM - check if salt-minion is running on searchidx1001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:24:05] mutante: we don't have a host db1001.pmtpa.wmnet, right? it must be db1001.eqiad.wmnet [20:25:04] Non-authoritative answer: [20:25:04] Name: db1001.eqiad.wmnet [20:25:06] matanya: ^ [20:25:09] PROBLEM - check if salt-minion is running on osmium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:25:15] should also be in site.pp [20:26:24] andrewbogott: i think you can shed light on this, as you are the commiter [20:27:09] matanya: which? [20:27:22] modules/mysql_wmf/files/master_id.py [20:27:50] bd808: so yeah, what day and times are good for you for beta + trebuchet play? [20:27:54] and modules/mysql_wmf/manifests/init.pp [20:28:00] RECOVERY - CI tmpfs disk space on gallium is OK: DISK OK [20:28:14] matanya: yes, 1001 is eqiad and combined with "pmtpa" would be wrong [20:28:27] i'll commit a fix for both [20:28:47] apergos: Thursday looks pretty open for me right now. And/or you could try to collaborate with hashar in EU day time. [20:28:58] thanks.. also . see the first salt checks coming in where salt is running twice above, heh [20:29:07] apergos: ^ [20:29:11] andrewbogott: have time to run a few commands for me on neon? [20:29:54] I'll talk to hashar too [20:29:55] Coren: hoo: i think it is time to stop responding for a while [20:29:57] thanks! [20:30:04] matanya: I don't think I know anything, those files are just strict copies from their previous location. [20:30:06] only making it worse [20:30:09] YuviPanda: e.g.? [20:30:09] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [20:30:21] andrewbogott: find what user exactly the icinga daemon is running as? [20:30:32] Lydia_WMDE: If you think it best; but as far as I can tell the discussion is rational and civil (though I may be missing nuances in the German?) [20:30:47] it is the right time to stop before it gets worse [20:31:01] (03PS1) 10Matanya: mysql_wmf: db1001 is in eqiad not in pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165090 [20:31:02] YuviPanda: user icinga. [20:31:03] I'll trust your judgement in the matter. :-) [20:31:19] andrewbogott: ah, cool. any processess running as user nagios? [20:31:23] thanks for clarifying andrewbogott [20:31:44] YuviPanda: nagios 13977 1 0 19:54 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d [20:31:51] that's the only one I see [20:31:52] ah, hmm [20:31:56] mutante: if we are still seeing tht isue I will want to know what causes it [20:32:05] I'm going to change everything owned by *root* to be owned by *icinga* now [20:32:12] I think that shouldn't cause anything to go wrong... [20:32:16] also tomorrow I'll look at the check (today, too wiped to do anything sensible now) [20:32:21] andrewbogott: hmm, anything running as root? [20:32:25] well, user mode programs... [20:32:43] err, icinga related ones, even.. [20:32:50] but if they are running as root (horror!) changing user to icinga [20:32:52] shouldn't matter [20:32:53] so nevermind [20:33:14] YuviPanda: any more misc servers in pmtpa? [20:33:20] matanya: no idea ;) [20:33:28] * YuviPanda isn't 'real ops' yet [20:33:47] YuviPanda: Nothing bad ever came out of someone asking "What could go wrong?" just before a change. :-) [20:33:47] YuviPanda: as root? Lots [20:34:05] Coren: heh :D I'm only *writing* the change now :) [20:34:15] andrewbogott: yeah, shouldn't matter tho. I'm not changing file perms... [20:34:39] moving things from root to icinga would make it *more* accessible, not less, so it should be ok [20:34:39] (03PS1) 10Matanya: icinga: no more misc in pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165091 [20:34:54] andrewbogott: BTW, I took your suggestion and added a nl after every cert when building the chain. [20:35:10] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [20:35:27] (03CR) 10Dzahn: "Matanya,i think this is needed to close #8512 , maybe you get to ask Ariel during EU hours" [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [20:35:40] (03Abandoned) 10coren: Tool Labs: apply /etc/iptables.conf on boot [puppet] - 10https://gerrit.wikimedia.org/r/156599 (https://bugzilla.wikimedia.org/53181) (owner: 10coren) [20:35:58] Coren: so I saw -- I'm trusting godog that extra newlines won't break anything [20:37:01] apergos: yea, so far just on 2 servers out of all of them, osmium and searchidx1001 [20:38:33] yoooo gwicke! [20:38:45] (hallo from some conference room in the same building that you are in!) [20:39:20] q for you when you are around [20:39:47] (03CR) 10Dzahn: "it found 2 cases where salt-minion is running twice: osmium and searchidx1001, all others have exactly 1 process, none with 0 found so far" [puppet] - 10https://gerrit.wikimedia.org/r/164651 (owner: 10Dzahn) [20:40:15] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [20:40:52] ^^ ack. [20:43:09] no page (should it?) [20:44:07] it pages fundraisers and me [20:44:35] they tell me to adjust the threshold....should resolve soon [20:45:05] k [20:45:15] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [20:46:39] ottomata: pong [20:46:56] what's up? [20:48:06] i'm thinking about how to import wiki text (revisions? diffs?) into hadoop [20:48:22] i know there are xml dumps, but I think those might be less ideal than importing from external storage(?) [20:48:27] (i'm reading wikitech pages about this now) [20:48:59] is that possible? Can I query some es machine to get all wiki text? [20:49:33] i see there are filesystem backups of the mysql dbs, if I can't query the live machines directly, I could probably spawn up a mysql db from a backup and hammer it myself? [20:49:41] dunno how to best do this, just starting to think about [20:49:42] it [20:49:46] (i was told that you might have some insight) [20:50:07] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [20:50:31] (03CR) 10Dzahn: Add a ferm service for ssh on all bastionhosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164542 (owner: 10Alexandros Kosiaris) [20:51:41] (03CR) 10Andrew Bogott: [C: 031] Sync custom repository with what's on Bugzilla production server already. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165085 (owner: 10Aklapper) [20:52:17] ottomata: so are you really looking for wikitext, or are you looking for content to run some analysis on? [20:52:47] halfak: is looking for wikitext, but maybe content in general is fine? he def wants revision history [20:53:38] (03CR) 10Dzahn: [C: 031] "i'll take this since i did the patch, just a minute" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165085 (owner: 10Aklapper) [20:53:42] ottomata: there are wikitext dumps including history [20:54:05] yes, xmldumps, right? [20:54:08] yup [20:54:15] i think that is less than ideal for hadoop. [20:54:21] they are not splittable very easily [20:54:31] I have a script that imports them into cassandra [20:54:40] into what format? [20:54:43] you could probably adapt that to do the same for hadoop [20:55:03] it just does HTTP requests for each revision [20:55:05] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [20:55:12] hm. i mean, i've got sqoop, which let's me just run a command to launch mappers that do parallel data load out of mysql [20:55:45] the dumps are already split, so you can run many imports in parallel [20:56:50] ottomata: https://github.com/gwicke/restbase-cassandra/blob/master/test/dump/RashomonDumpImporter.js [20:57:04] and https://github.com/gwicke/restbase-cassandra/blob/master/test/dump/dumpReader.js [20:57:50] andrewbogott: can you tell me if there are any extra sudo rules in neon? [20:58:31] andrew@neon:/etc/sudoers.d$ ls [20:58:32] 50_ops nagios README [20:58:46] andrewbogott: can you cat nagios? [20:59:04] https://dpaste.de/6eNM [20:59:45] hmm, that is fine, I think [21:00:14] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=2: recurring_gc_contribs_missed=1371 [critical =300]: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [21:02:09] gwicke: parsing your scripts, what's http://localhost:8000/enwiki/page/? [21:03:20] andrewbogott, i can get into bastion(.wmflabs.org) but cannot get into deployment-bastion.eqiad.wmflabs from there. [21:03:44] (03CR) 10Ottomata: [C: 032 V: 032] 0.8.1.1-3 release [debs/kafka] - 10https://gerrit.wikimedia.org/r/162458 (owner: 10Plucas) [21:04:02] (03PS1) 10Dzahn: make boron an 'official bastion host' [puppet] - 10https://gerrit.wikimedia.org/r/165098 [21:04:57] (03CR) 10Dzahn: "make boron an actual bastion host first:" [puppet] - 10https://gerrit.wikimedia.org/r/164542 (owner: 10Alexandros Kosiaris) [21:04:59] subbu: deployment-bastion is a bastion, you can connect to it directly. [21:05:13] RECOVERY - check_fundraising_jobs on db1025 is OK: OK missing_thank_yous=2: recurring_gc_contribs_missed=1293: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1102: recurring_gc_schedule_sanity=0 [21:05:14] ottomata: I used this to import those dumps into rashomon (now restbase), which had an API at that location [21:05:43] you'd need to adapt this for your purposes [21:05:47] andrewbogott: what's its DNS name, then? [21:05:54] welll [21:06:02] `ssh -A deployment-bastion.wmflabs.org` doesn't work.. [21:06:09] hmph, its dns name seems to be 'udplog.wmflabs.org' which doesn't make a tone of sense. [21:06:20] you'll have to take that up with hashar or someone involved in that project. [21:06:33] and subbu is trying to become an OCG deployer, so eventually he'll need to ssh into the pdf instances, so i suspect we'll have to fix the root cause in any case. [21:06:45] ottomata: overall it's pretty simple SAX-style XML handling really [21:07:34] cscott: subbu, you do need to be a member of a project in order to ssh to instances. [21:07:35] andrewbogott: `ssh -A udplog.wmflabs.org` doesn't work for me either [21:07:37] Which it seems you aren't? [21:07:40] maybe they need additional project membership to get on deployment-bastion [21:08:22] it's in the deployment project. So, yeah, you need to be a member of deployment-prep. [21:08:23] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 107, down: 0, dormant: 0, excluded: 0, unused: 0 [21:08:23] [[User:Subramanya Sastry]] is listed at https://wikitech.wikimedia.org/wiki/Nova_Resource:Bastion [21:08:32] oh, :8000 is your cassandra url? gwicke? [21:08:59] cscott: yes, which is consistent with the fact that he can log into things in the bastion project... [21:09:05] ah, yes, subbu's not on https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [21:09:33] ottomata: no, it's the REST API port [21:09:37] no, oh [21:09:56] bwerr, ok, but that is where you are sending the data after you get it [21:09:59] andrewbogott: how do we add him to deployment-prep? [21:10:00] k [21:10:12] cscott: you are a project admin, you can do it. [21:10:12] to gt onto deployment-bstion you should be a member of the deployment-prep project I think [21:10:15] ottomata: how do you normally import data into hdfs? [21:10:26] * cscott relishes his powers [21:10:32] (03CR) 10Dzahn: [C: 032] Sync custom repository with what's on Bugzilla production server already. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165085 (owner: 10Aklapper) [21:10:34] cscott: https://wikitech.wikimedia.org/wiki/Special:NovaProject [21:10:43] you may need to log out and in again, sessions have been buggy this week [21:10:50] heh indeed [21:10:58] (03CR) 10Dzahn: [V: 032] Sync custom repository with what's on Bugzilla production server already. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165085 (owner: 10Aklapper) [21:10:59] gwicke: depends on where the data comes from [21:11:07] ottomata: http://stackoverflow.com/questions/20929000/which-nodejs-library-should-i-use-to-write-into-hdfs [21:11:09] so far, we've only got stuff from kafka, which is webrequest logs [21:11:12] if you'd like to use node [21:11:15] but, we are also starting to do mysql data [21:11:16] so, sqoop [21:11:25] http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_introduction [21:11:50] the problem is that you don't have a relational db that has all the revision info [21:12:07] afaik at least [21:12:08] that's fine, i'd just be doing an import of everything i think [21:12:12] it is a mysql db, no? [21:12:14] i'd get [21:12:17] ok subbu, you are now a deployment-prepper. [21:12:18] rev_id, text [21:12:20] right? [21:12:21] that's it [21:12:25] subbu: try to login again? [21:12:26] ottomata: not in production [21:12:27] alright .. let me retry [21:12:30] gwicke: no? [21:12:40] andrewbogott: is there a delay for the novarole changes to propagate? [21:12:42] not sure if there is a research db somewhere that has all revisions [21:12:50] nope [21:12:52] ottomata: production uses external store [21:12:56] yes, i know [21:12:58] cscott, in now .. will continue. [21:12:58] that's ok [21:13:08] gwicke: i'm bringing in all that data into one place, hadoop (hive) [21:13:10] it's separate db clusters [21:13:29] so, i will import relevant tables from regular MW databases (page, revision, text) [21:13:41] so probably not not straightforward to use with sqoop [21:13:42] and then also the blobs table from the es servers [21:13:50] i'm not sure why...? [21:13:50] there is no text table [21:14:01] that's in external store [21:14:05] https://wikitech.wikimedia.org/wiki/External_storage#Database_Schema [21:14:15] blobs [21:14:36] where are you btw? [21:14:45] might be easier to chat about this IRL [21:15:29] i'm in a room with a bunch of analytics folks, but i'm not listening to what they are talking about [21:15:31] i will come find you [21:15:44] 3rd floor techy side, ja? [21:16:39] chasemp: modules/mw-rc-irc/files/upstart/ircecho.conf calls pmtpa, where should it point now ? [21:17:11] it's just a label, name of the bot I think [21:17:27] some mediawiki peeps decided to leave it as the new stuff wasn't far behind [21:17:38] and it would force a lot of ppl to change the bot name and it was more hassle than value [21:17:51] so it can stay for all it care, thanks [21:18:37] how does one become a member of the wikidev group? it seems i am one, but subbu maybe is not? [21:18:53] you are welcome to track it down man and make it consistent :) I would like that but it seemed way more work than worth [21:19:54] matanya: https://gerrit.wikimedia.org/r/#/c/136965/ [21:20:12] gwicke, subbu: backend latency spike coinciding with spike on the parsoid servers: [21:20:13] http://tinyurl.com/kznjpe3 [21:20:18] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [21:20:36] (03PS1) 10Yurik: Enable graph ext on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165099 [21:20:50] 17:19 <+ subbu> so, we reduced a timeout request on varnish failures .. so, whoever was hitting parsoid to crawl stuff is hitting parsoid faster => we are hitting api.php in turn. [21:20:54] 17:20 <+ subbu> cscott, greg-g will be back online in 5 mins or lesser. [21:21:01] continue conversation here when you're back, subbu :) [21:21:08] k [21:23:58] (03PS1) 10Dzahn: Revert "Sync custom repository with what's on Bugzilla production server already." [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165100 [21:24:00] (03PS2) 10Yurik: Enable graph ext on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165099 [21:24:35] (03CR) 10Dzahn: [C: 032] "a problem showed up during sanity check before deployment. this was merged but not deployed yet and needs a follow-up change" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165100 (owner: 10Dzahn) [21:25:10] (03CR) 10Dzahn: [V: 032] Revert "Sync custom repository with what's on Bugzilla production server already." [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165100 (owner: 10Dzahn) [21:26:27] ori, greg-g gwicke back. [21:26:57] how does one become a member of the wikidev group? it seems i am one, but subbu maybe is not? [21:27:02] ^ deployers are wikidev [21:30:03] (03CR) 10Alexandros Kosiaris: "Mark's -2 can be removed now and the change merged. I already have +2ed so merging" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [21:30:20] ori: wikidev has become a perversion. Basically it was used as a catchall pre-me [21:30:36] anyone who was anyone who needed a group was thrown in, in some places it is still used as a meaningful thing [21:30:56] but by and large any meaning it was once had was diluted by it becoming the "I can't think of another group"-group [21:30:58] anyone know how a reader of wikipedia gets cookies set? Like, people that never hit apache [21:31:09] so now it's the default group more or less for people, and should be renamed and the whole thing needs some love [21:31:11] but they get GeoIP cookies, etc. [21:31:15] milimetric: Our beloved gadgets do that sometimes [21:31:18] but if you see GID 500 in admin.yaml [21:31:18] milimetric: GeoIP is set by varnish [21:31:21] that's wikidev [21:31:25] ori: danke [21:31:28] and it's most everyones GID [21:31:32] chasemp: sigh. thanks for raising that. we should probably clean that up. [21:32:00] (03PS1) 10Aklapper: Sync custom repository with what's on Bugzilla production server already. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165107 [21:33:27] it looks like the PHP API is overloaded [21:33:49] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:34:54] gwicke, that is because of the parsoid deploy .. that is what ori and greg-g flagged. [21:35:20] yes, the old problem of having to throttle parsoid to avoid overloading the PHP API [21:35:31] anything we should do right now? [21:35:35] in turn because of increased traffic to parsoid after we reduced varnish failure timout from 60s -> 10s .. so those reqs finish faster. [21:35:37] slow down parsoid [21:35:57] it should also slow down once the job queue has emptied [21:35:58] i presume it is kiwix that is crawling? [21:36:32] subbu: do we know that it's crawling? [21:36:37] i don't. [21:37:05] how do we slow down parsoid? [21:37:14] http://ganglia.wikimedia.org/latest/graph.php?c=API%20application%20servers%20eqiad&m=cpu_report&r=day&s=by%20name&hc=4&mc=2&st=1412631397&g=load_report&z=medium&r=day is looking pretty dire [21:37:22] any wikidata people online? [21:37:36] hoo: ^ [21:37:41] a few of them in #wikidata [21:37:44] by looking at varnishncsa, there are a lot of enwiktionary & frwiktionary requests [21:39:56] (03PS1) 10Alexandros Kosiaris: Remove nfs1 from dns [dns] - 10https://gerrit.wikimedia.org/r/165111 [21:39:58] and there are 15k low-prio jobs in the enwiki queue [21:41:59] I think this is faster job queue processing [21:42:34] greg-g: Ok, duh [21:42:41] <_joe_> we need an easy, coordinated way to throttle the jobs [21:43:21] this is a cyclical thing [21:43:33] sometimes the job queue is too large, and we increase the speed [21:43:43] then the API cluster becomes overloaded, and we reduce it again [21:43:47] so it goes.. [21:44:09] <_joe_> meaning that we cyclically realize that we need to coordinate our SOA so that subsystems don't DOS one another? [21:44:39] we know that the API cluster was running at 50+% capacity for a while now [21:44:51] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 10.128.128.1, interfaces up: 35, down: 0, dormant: 0, excluded: 1, unused: 0 [21:45:05] https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=API+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=cpu_report [21:45:22] <_joe_> so maybe the problem is this amount of jobs that have appeared just after the parsoid release [21:45:31] <_joe_> why has that happened? [21:45:35] greg-g, ori gwicke, for now, temporarily, i can increase the timeout back to 60 sec .. but, once we debug the issue that is causing the timeouts and fix it, this problem will resurface. [21:45:51] we can reduce the number of parsoid job runners [21:46:01] or, that as well. [21:46:05] probably better. [21:47:35] !log updated OCG to version bbdf4c6400cfbbc6030114ad16e1a6f7025eab2c [21:47:40] <_joe_> so, sorry but I need to understand: did the deploy caused a larger queue? [21:47:41] Logged the message, Master [21:48:04] <_joe_> or the queue was there, and we're just funneling it faster than before? [21:48:21] <_joe_> or both? [21:48:31] _joe_, faster funneling because the jobs are finishing faster [21:48:46] ...because of the deploy, right? [21:48:47] <_joe_> ? [21:49:03] whereas some would take 60s+ earlier because of timeouts on varnish cache. [21:49:26] <_joe_> oh so you did remove the varnish cache? [21:49:34] we reduced the timeout from 60s -> 10s since we realized we don't need a 60s timeout. [21:49:53] so, those jobs that were timing out on varnish now finish faster. [21:49:56] <_joe_> but maybe the upstream servce could use that [21:50:14] <_joe_> you didn't realize that apparently [21:50:17] (03PS1) 10GWicke: Decrease the number of parsoid job runners [puppet] - 10https://gerrit.wikimedia.org/r/165113 [21:50:25] there you go ^^ [21:50:25] <_joe_> ok the solution is simple - rollback. [21:50:36] subbu: did we ever get a confirmation that varnish is in fact the problem there? I still haven't seen such a thing I think, but I don't remember where we left off with it Friday. [21:50:57] bblack, we don't know what is happening yet .. but we figured we don't need a 60s timeout in any case. [21:51:04] right, ok [21:51:21] <_joe_> gwicke: you are basically disconnecting from long-running jobs on api that probably keep churning data while you retry [21:51:31] no [21:51:46] the timeout is only lowered for pure varnish requests [21:51:48] <_joe_> what do you do when you reach the timeout? [21:51:49] it would be nice if we could get a network trace of the traffic between parsoid<->varnish in one of those timeout cases, and confirm the timeout happening [21:52:15] bblack, but, what we know is that some reqs. to varnish never get a response back .. and our next task is to figure out what is going on there ... since you showed on friday that hitting varnish with curl does get responses back and it is probably nota problem on the varnish end. [21:52:29] chasemp: there are several users in data.yaml who are members of the 'absent' group but still have keys in their account stanza... [21:52:32] that's useless, right? [21:52:39] I mean, the key -- harmless but useless? [21:52:53] yes other than I'm not sure if somewhere even a blank key value is required [21:53:04] subbu: well, my test queries are good evidence that it's probably not varnish, but technically there could be a difference in parsoid's query and mine (if nothing else, temporally) [21:53:31] <_joe_> it's very late in my timezone and I'm going to bed. My understanding is that a correct path in rolling back that timeout to 60 seconds [21:53:38] andrewbogott: "absent" group is just a dummy group for running accounts that have "esnure => absent" [21:53:45] so their profiles should have that as well [21:53:49] should write a linter to verify [21:53:52] (I should) [21:53:54] _joe_: normally there shouldn't be any timeout there in any case [21:54:10] the timeout slowed down normal job processing [21:54:12] _joe_, gwicke reduced the # of job runners, and as i indicaed, once we fix the timeout problem, this problem will come back. [21:54:22] the prudent thing to do here is to lower the job parallelism [21:54:45] chasemp: https://wikitech.wikimedia.org/wiki/Offboard#Revoke_shell_access_in_production [21:54:56] once the timeout is fixed, we might need to lower the job parallelism even further [21:54:58] the timeout problem has been preset for a long time, though, it's not new. What was new was the lack of retry countdown before "0 retries" [21:55:00] chasemp: please correct as needed [21:55:12] (and the percentage of long timeouts didn't even change much) [21:55:25] there was never a retry for only-if-cached requests [21:55:30] andrewbogott: looks good man! [21:55:36] as it doesn't make sense to retry those [21:55:49] (03CR) 10Giuseppe Lavagetto: "While this can help, as I stated on irc, the correct mitigation from what I understood there is rolling back the timeout to 60 seconds." [puppet] - 10https://gerrit.wikimedia.org/r/165113 (owner: 10GWicke) [21:56:02] it looks though as if those timeouts have been going on for a while now [21:56:07] bblack, yes, we were debating whether the timeout ratio has been this high for a long time and the cache flush exposed it .. or if it something that is new. [21:57:07] e.g. in a log on wtp1009 from back on Sept 14th (parsoid.log.14.gz): [info][eswiki/Altos_del_Rosario?oldid=68783131] completed parsing in 61811 ms [21:57:40] I compared the ratio of those 60+s timeouts to lines in the files on one host friday, and the ratio hadn't changed a lot in general. [21:59:29] I vaguely remember looking into those 60s parses a while ago [21:59:32] (03CR) 10Jgreen: [C: 04-1] "boron is a frack host configured by frack puppet, there's a note to that effect at the bottom of site.pp. Just remove the whole node block" [puppet] - 10https://gerrit.wikimedia.org/r/165098 (owner: 10Dzahn) [21:59:45] bblack: but didn't find the reason back then [21:59:56] !log Reverted wd:Q17939676 to 157541810 and edit=sysop [22:00:02] we now know that it's connected to only-if-cached requests timing out [22:00:03] Logged the message, Master [22:01:06] Before anyone asks: No need for superprotection here, the case is slightly less dangerous (as the serialization format didn't change) [22:01:22] (03CR) 10Dzahn: "Alex: Jeff says boron is not a bastion host, but on https://gerrit.wikimedia.org/r/#/c/164542/2/manifests/site.pp line 1353 says " # TODO" [puppet] - 10https://gerrit.wikimedia.org/r/165098 (owner: 10Dzahn) [22:01:51] (03PS1) 10Andrew Bogott: Offboard swalling [puppet] - 10https://gerrit.wikimedia.org/r/165114 [22:02:40] (03CR) 10GWicke: "The correct thing is to not rely on a timeout to slow down job processing. Once the reason for the timeout is fixed, parsoid will pick up " [puppet] - 10https://gerrit.wikimedia.org/r/165113 (owner: 10GWicke) [22:04:29] chasemp: https://gerrit.wikimedia.org/r/#/c/165114/ [22:04:39] gwicke: just to make sure I'm looking at things right: in these cases with only-if-cached the parsoid software running on wtp100X should be connecting over http to parsoid-lb.eqiad.wikimedia.org right? [22:04:49] bblack: yes [22:05:10] andrewbogott: what's the rt ticket? [22:05:34] (03PS1) 10Dzahn: remove boron node from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/165117 [22:05:45] (03PS2) 10Andrew Bogott: Offboard swalling [puppet] - 10https://gerrit.wikimedia.org/r/165114 [22:05:46] 8507 [22:06:13] bblack: subbu's patch only lowered the timeout for only-if-cached requests, so the fact that this caused parsoid to speed up processing means that at least a large portion of those timeouts are only-if-cached requests [22:06:25] (03CR) 10Dzahn: "Jeff, thanks! well then -> https://gerrit.wikimedia.org/r/#/c/165117/" [puppet] - 10https://gerrit.wikimedia.org/r/165098 (owner: 10Dzahn) [22:06:54] https://gerrit.wikimedia.org/r/#/c/164717/ is the patch in qn. [22:08:07] (03CR) 10Rush: [C: 032] "nice" [puppet] - 10https://gerrit.wikimedia.org/r/165114 (owner: 10Andrew Bogott) [22:09:34] gwicke: I've been sitting on wtp1009 in a tcpdump command for a while now, and seeing zero traffic at all, of any kind, between wtp1009 <-> parsoid-lb.eqiad.wikimedia.org (except when I generate it via a test curl command on that host) [22:10:19] let me look at config file for where we point to (for caches) [22:10:25] (also, is it remotely possible that there's some IPv6 problem going on here? the resolver library is very likely giving you a v6 lookup on that name, perhaps the software doesn't deal with that well) [22:10:28] bblack: just double-checked the parsoid config.. it looks like there's an IP in there [22:10:43] parsoidConfig.parsoidCacheURI = 'http://10.2.2.29/'; [22:10:44] parsoidConfig.parsoidCacheURI = 'http://10.2.2.29/'; [22:11:05] that's parsoidcache, not parsoid-lb [22:12:29] ah, you mean parsoid-lb as in the backend [22:12:38] (03CR) 10Dzahn: [C: 032] " @zirconium:/home/dzahn/modifications#" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165107 (owner: 10Aklapper) [22:12:44] I thought you meant the private version of parsoid-lb.wikimedia.org [22:12:59] I just meant "whatever hostname the software is connecting to" :) [22:13:11] so my curl tests friday were invalid, they're not testing against the same thing you're connecting to [22:13:14] I'm looking at https://wikitech.wikimedia.org/wiki/Parsoid#Caching_and_load_balancing [22:13:25] 10.2.2.29 is the LVS in front of the Varnishes [22:13:41] parsoidcache.svc.eqiad.wmnet [22:14:01] gwicke, bblack perhaps the job-runner patch should be pushed before this cache problem is fixed .. just in case. [22:14:03] so that looks correct to me [22:14:39] (03CR) 10Dzahn: [V: 032] Sync custom repository with what's on Bugzilla production server already. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/165107 (owner: 10Aklapper) [22:15:55] I'm sure it is correct, it's just my testing was incorrect [22:16:09] I got the idea that it was parsoid-lb.eqiad.wikimedia.org somehow on Friday [22:16:23] that should map to the same LVS [22:16:37] well, yes, it should [22:16:56] a bit confusing naming though [22:17:05] and I do seem to get the same results from both on various test queries [22:17:12] parsoid-lb.eqiad.wikimedia.org mapping to parsoidcache.svc.eqiad.wmnet internally [22:17:18] (03PS1) 10Andrew Bogott: Revoke access for bsitu. [puppet] - 10https://gerrit.wikimedia.org/r/165122 [22:17:52] (03CR) 10Rush: [C: 031] Revoke access for bsitu. [puppet] - 10https://gerrit.wikimedia.org/r/165122 (owner: 10Andrew Bogott) [22:18:07] (03CR) 10Andrew Bogott: [C: 032] Revoke access for bsitu. [puppet] - 10https://gerrit.wikimedia.org/r/165122 (owner: 10Andrew Bogott) [22:18:14] bblack: same here [22:18:35] we'll need to double-check those cache requests [22:18:44] on the parsoid side [22:21:39] I'm trying to dig through the puppetization of it now. the parsoid/parsoidcache there is a bit confusing in places/ [22:21:59] bblack: could you look at the job runner patch, so that we stop DOSing the API cluster? [22:22:06] hoping maybe somewhere something is misconfigured in a backend list [22:22:08] (03PS3) 10Alexandros Kosiaris: decom nfs1 [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [22:22:32] gwicke: oh I was just chiming in earlier when I saw a familiar topic, I figured someone else was already on any patch [22:22:45] what patch? [22:22:46] I think _joe_ went to bed [22:23:02] bblack: https://gerrit.wikimedia.org/r/#/c/165113/ [22:23:03] oh right [22:23:36] thanks! [22:23:52] (03PS2) 10BBlack: Decrease the number of parsoid job runners [puppet] - 10https://gerrit.wikimedia.org/r/165113 (owner: 10GWicke) [22:23:57] (03CR) 10BBlack: [C: 032 V: 032] Decrease the number of parsoid job runners [puppet] - 10https://gerrit.wikimedia.org/r/165113 (owner: 10GWicke) [22:24:49] (03CR) 10Dzahn: "eh, i should have said "make hooft an actual bastion host first"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164542 (owner: 10Alexandros Kosiaris) [22:25:33] bblack: I suspect that something about those requests was broken on the parsoid side a while ago [22:25:53] (03PS1) 10Yuvipanda: icinga: Make all config files belong to icinga:icinga [puppet] - 10https://gerrit.wikimedia.org/r/165123 [22:27:06] (03PS2) 10Yuvipanda: icinga: Make all config files belong to icinga:icinga [puppet] - 10https://gerrit.wikimedia.org/r/165123 [22:27:32] (03CR) 10Yuvipanda: [C: 04-1] "I don't have access to neon, so would need thorough review by someone else." [puppet] - 10https://gerrit.wikimedia.org/r/165123 (owner: 10Yuvipanda) [22:28:30] !log on mw1088 debugging crash bug 71542 [22:28:39] Logged the message, Master [22:29:02] andrewbogott: can you run ls -LR on : 1. /etc/nagios, 2. /etc/icinga, 3. /var/lib/icinga, 4. /var/lib/nagios for me? [22:29:42] YuviPanda: yep, stay tuned... [22:29:46] on neon? [22:29:47] cool, thanks [22:30:03] andrewbogott: ya [22:30:06] gwicke: well now that I'm looking at the right traffic, I have more to go on and I'll look deeper later tonight and see what I can find. Just in some quick sampling, though, I have what (at least appears, at first?) to be evidence of one of these wtp->parsoidcache requests basically pulling down 4GB of data and then resetting the connection at that point? I have no wtf is going on there... [22:30:32] (03PS4) 10Dzahn: decom tarin (pmtpa poolcounter) [puppet] - 10https://gerrit.wikimedia.org/r/152154 [22:31:10] bblack: that sounds very fishy indeed [22:32:39] (03PS5) 10Dzahn: decom tarin (formerly pmtpa ganglia) [puppet] - 10https://gerrit.wikimedia.org/r/152154 [22:44:33] andrewbogott, this is the cassandra update rt ticket: https://rt.wikimedia.org/Ticket/Display.html?id=8530 [22:44:43] eh, sorry [22:44:53] ottomata: [22:44:55] ^^ [22:47:42] greg-g, ori gwicke looks like load is coming back down after the # job runners were reduced. [22:48:31] gwicke: what time is that meeting? I kinda wanna join [22:49:12] subbu: good deal [22:49:22] subbu: now, how do we prevent it from happening in the future? :) [22:49:40] greg-g: The deployments schedule - is there a template for a week's schedule that you could use instead of copying the last week's? [22:49:44] (I wasn't following your previous discussion, and dont' care about the details, just let me know what you need from me ;) ) [22:49:58] marktraceur: I copy from last week's [22:50:00] ottomata: tomorrow at 9 [22:50:03] I added you [22:50:15] greg-g: 'cause stuff like https://wikitech.wikimedia.org/w/index.php?diff=prev&oldid=130110 happens [22:50:17] marktraceur: if you want a fun 20% project..... [22:50:20] When we take notes [22:50:27] We don't do 20% time anymore I thought? [22:50:37] haha [22:50:40] Just slavish devotion to the Holy Priorities of The Foundation [22:50:43] haha re diff [22:50:46] greg-g: ideally, have more headroom on the API cluster [22:50:48] we have 10% research time [22:50:55] and yes, all effort towards top 5ish priorities only [22:50:58] greg-g, i guess it depends on api capacity, rate at which jobs get queued into the job-queue ... and how fast we want it drained. [22:51:05] PRAISE BE TO THE TOP FIVE. [22:51:11] :) [22:51:20] a while ago we were shooting for 20% utilization IIRC [22:51:44] gwicke: given we don't yet have an SLA for anything, should the requesting service self-throttle when load is too high on the api? [22:52:17] gwicke, but if the varnish issue is fixed, some of the reqs. to the api cluster (which woudl otherwise have been served by varnish hit) would also reduce. [22:52:19] marktraceur: but seriously.... give me a calendar that isn't google calendar that works [22:52:27] greg-g: I proposed to incorporate such throttling into the job runner in the past [22:52:31] (03CR) 10Alexandros Kosiaris: [C: 032] decom nfs1 [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [22:53:14] gwicke: where ever I guess, I just think we might have to be aware of this kind of issue as we do more SOA (where a solution is usually: put some more logic somewhere to prevent stampedes) [22:53:25] (03CR) 10Alexandros Kosiaris: [C: 032] Remove nfs1 from dns [dns] - 10https://gerrit.wikimedia.org/r/165111 (owner: 10Alexandros Kosiaris) [22:53:38] why the fuck is there a /var/lib/icinga/icinga.log?!!?! [22:53:48] greg-g: You'd prefer off-wiki? [22:53:58] greg-g: <5332F7B3.9010900@wikimedia.org> [22:54:14] marktraceur: not inherently. I'd prefer "useful and easy to use" [22:54:19] Hm. [22:54:40] gwicke: what is that? [22:54:41] thread 'Job queue length alert', March 26 2014 [22:54:43] oh [22:54:43] Obviously the answer is a giant org-mode file on tin. [22:54:53] greg-g: it's a message id [22:54:57] yeah, brain fart [22:55:59] (03PS3) 10Yuvipanda: icinga: Make all config files belong to icinga:icinga [puppet] - 10https://gerrit.wikimedia.org/r/165123 [22:56:14] gwicke: tl;dr ;) [22:56:21] (the thread) [22:56:35] (though mutt says I read it before...) [22:58:07] hah, to search for message id in gmail you have to prepend with " rfc822msgid:", not just oh I don't know "id:" [22:59:06] greg-g, and the other thought is that once the api cluster moves to hhvm, capacity will also go up likely. [22:59:28] sure, but that only delays the issue again [22:59:42] "Load increases over time" - Someone Smart [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem, yurik: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141006T2300). [23:00:12] I'll do it [23:00:15] Ohai jouncebot. [23:00:19] Yurik has a patch in anyway? [23:00:26] yep [23:00:28] 2 [23:01:19] (03CR) 10MaxSem: [C: 032] Added 4 ppl to notifyOnAllChanges for zero portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164902 (owner: 10Yurik) [23:01:26] (03Merged) 10jenkins-bot: Added 4 ppl to notifyOnAllChanges for zero portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164902 (owner: 10Yurik) [23:01:41] (03CR) 10MaxSem: [C: 032] Enable graph ext on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165099 (owner: 10Yurik) [23:01:49] (03Merged) 10jenkins-bot: Enable graph ext on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165099 (owner: 10Yurik) [23:01:54] yurikR, poor man's code review? :P [23:02:01] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [23:03:07] eh, whose comment-only change wasn't merged on tin? :P [23:04:00] MaxSem, eh? [23:04:07] oh [23:04:09] lol [23:04:33] mostly its for the config changes - i want our team to be notified whenever partner makes a change [23:04:37] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:05:15] !log maxsem Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/165099/ https://gerrit.wikimedia.org/r/#/c/164902/ (duration: 00m 04s) [23:05:22] Logged the message, Master [23:06:33] !log maxsem Synchronized php-1.25wmf1/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:06:38] Logged the message, Master [23:06:44] !log maxsem Synchronized php-1.25wmf2/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:06:50] Logged the message, Master [23:11:05] (03PS6) 10Dzahn: decom tarin (formerly pmtpa ganglia) [puppet] - 10https://gerrit.wikimedia.org/r/152154 [23:12:30] (03CR) 10GWicke: "API server load dropped back to ~69% with this patch applied. Still overloaded, but no longer critical." [puppet] - 10https://gerrit.wikimedia.org/r/165113 (owner: 10GWicke) [23:12:36] (03PS7) 10Dzahn: decom tarin (formerly pmtpa ganglia) [puppet] - 10https://gerrit.wikimedia.org/r/152154 [23:14:01] (03CR) 10Dzahn: [C: 032] "checked with cmjohnson - already disconnected physically and being wiped" [puppet] - 10https://gerrit.wikimedia.org/r/152154 (owner: 10Dzahn) [23:15:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:17:14] ACKNOWLEDGEMENT - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #6265 - Tampa decom [23:32:28] (03PS2) 10Dzahn: remove tarin [dns] - 10https://gerrit.wikimedia.org/r/164128 [23:37:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [23:43:29] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [23:49:14] !log tarin, nfs-1 - revoked salt key,puppet cert, stored configs [23:49:26] Logged the message, Master