[00:51:26] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 320 seconds [00:51:36] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 333 seconds [00:52:36] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:27] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:59:37] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:21:55] springle: around? [01:22:54] Or any other op? [01:35:40] hoo: what's up [01:35:57] springle: Could you quickly restart gitblit on antimony? [01:36:03] Seems it hung up again [01:36:09] has already been done today (see SAL) [01:40:40] restarting the service doesn't seem to do it [01:40:44] * springle pokes around [01:40:59] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54239 bytes in 0.520 second response time [01:41:41] hoo: should http://gitblit.wiimedia.org/ come back too? [01:41:53] no, git.wikimedia.org is the domain [01:41:56] works again AFAIS [01:42:03] oh right [01:42:05] cool [01:42:15] Thank you :) [01:42:24] !log restarted gitblit [01:42:37] Logged the message, Master [02:15:19] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-14 02:15:19+00:00 [02:15:28] Logged the message, Master [02:27:23] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-14 02:27:23+00:00 [02:27:32] Logged the message, Master [03:34:53] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Oct 14 03:34:53 UTC 2014 (duration 34m 52s) [03:35:01] Logged the message, Master [04:13:23] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:20:30] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:22] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: puppet fail [04:49:53] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [05:14:43] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [05:47:43] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54239 bytes in 4.202 second response time [06:28:23] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: puppet fail [06:29:32] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:15] (03CR) 10Krinkle: "FIXME: Broke integration-slave1006.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [06:36:19] (03CR) 10Krinkle: "https://bugzilla.wikimedia.org/show_bug.cgi?id=72014" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [06:46:23] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:54:42] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:49] * _joe_ looking at ocg [06:57:33] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54239 bytes in 0.770 second response time [07:05:52] It now cleans disk but has a problem with lingering processes? https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=proc_run&s=by+name&c=PDF+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [07:08:11] <_joe_> Nemo_bis: ocg is working fine [07:08:27] <_joe_> now [07:08:42] <_joe_> it had a problem of hanging latexer processes [07:09:03] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:24] <_joe_> which cscott should address when he's back from bermudas :) [07:25:33] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54239 bytes in 1.254 second response time [07:32:53] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:29] :) [07:44:03] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54239 bytes in 6.095 second response time [07:49:35] (03CR) 10Hashar: "The puppetmaster init script on integration-puppetmaster.eqiad.wmflabs does clear out the pid file. Looking at the script it uses start-" [puppet] - 10https://gerrit.wikimedia.org/r/166516 (owner: 10Alexandros Kosiaris) [08:20:12] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 226139 msg: ocg_render_job_queue 1014 msg (=500 critical) [08:20:23] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 226430 msg: ocg_render_job_queue 1119 msg (=500 critical) [08:20:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 226658 msg: ocg_render_job_queue 1190 msg (=500 critical) [08:22:44] <_joe_> and here we go! [08:29:13] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 229567 msg: ocg_render_job_queue 29 msg [08:29:42] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 229592 msg: ocg_render_job_queue 0 msg [08:29:52] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 229616 msg: ocg_render_job_queue 0 msg [08:40:41] I wish those notifications messages to point to some graphite graphs [08:41:16] <_joe_> hashar: you have ganglia graphs [08:41:24] <_joe_> because everyone is on the same page here [08:46:25] yeah I guess I would know where to find the information if I was tasked in handling those errors [08:46:25] :D [08:47:15] Hi, there is a user in #wikipedia-nl who received an e-mail from Wikimedia with a link to the donate page of the WMF (links.email.donate.wikimedia.org). But it seems there is a problem with the SSL certificate: "you reached a server what identifies as *.links.mkt41.net" [08:47:15] Does anyone know something about this? [08:48:42] indeed [08:51:28] Jurgen|Cloud: seems the DNS entry has been set like 2 years ago , maybe ssl got enabled recently [08:51:44] Jurgen|Cloud: probably want to fill a bug for ops by e-mailing ops-requests@rt.wikimedia.org [08:53:32] <_joe_> hashar: I strongly doubt that [08:53:55] <_joe_> also, I don't think ops have a lot to do with that [08:54:06] <_joe_> but yeah we can start investigating maybe [08:54:35] (03PS1) 10KartikMistry: Beta: Add missing link to init with upstart-job [puppet] - 10https://gerrit.wikimedia.org/r/166535 [08:55:44] can probably be triaged to Jeff Green / Donate team [08:57:19] okay, I'll send an e-mail [08:58:20] (03CR) 10Hashar: [C: 04-1] Beta: Add missing link to init with upstart-job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/166535 (owner: 10KartikMistry) [09:07:43] (03PS2) 10KartikMistry: Beta: Add missing link to init with upstart-job [puppet] - 10https://gerrit.wikimedia.org/r/166535 [09:12:48] (03CR) 10Hashar: [C: 031] Beta: Add missing link to init with upstart-job [puppet] - 10https://gerrit.wikimedia.org/r/166535 (owner: 10KartikMistry) [09:14:16] (03PS8) 10KartikMistry: Apertium service configuration for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [09:28:47] (03CR) 10Hashar: "That fixed the issue we had on the beta cluster instance deployment-cxserver03.eqiad.wmflabs \O/" [puppet] - 10https://gerrit.wikimedia.org/r/166535 (owner: 10KartikMistry) [09:29:13] (03PS3) 10Christopher Johnson (WMDE): Change phab_update_tag script to remove library lock file [puppet] - 10https://gerrit.wikimedia.org/r/166406 [09:29:37] (03PS1) 10Giuseppe Lavagetto: apache: fix diamond stats collections. [puppet] - 10https://gerrit.wikimedia.org/r/166538 [09:30:13] Thanks hashar! [09:34:17] (03CR) 10Christopher Johnson (WMDE): Change phab_update_tag script to remove library lock file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [09:35:24] * aude wonders why we have log entries for mw1163 on wmf/1.25wmf1 [09:35:26] (03CR) 10Christopher Johnson (WMDE): Change phab_update_tag script to remove library lock file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [09:35:33] what is running that version? [09:35:47] it's the job runner, right? [09:36:17] (03PS4) 10Christopher Johnson (WMDE): Change phab_update_tag script to remove library lock file [puppet] - 10https://gerrit.wikimedia.org/r/166406 [09:36:41] it has stuff on wmf1 [09:36:49] _joe_: do you know? [09:37:28] <_joe_> aude: it's not a jobrunner, it's an appserver [09:37:37] ah ok [09:37:45] i think it was a jobrunner [09:37:48] or i am confused [09:37:55] <_joe_> mw1163? no [09:37:58] ok [09:39:02] <_joe_> aude: mw1053 was once [09:39:09] <_joe_> aude: but, lemme check something [09:39:15] it's not in dsh [09:39:17] i can add it [09:39:31] <_joe_> aude: I added it! [09:39:36] <_joe_> I'm sure I did [09:39:42] <_joe_> 1 sec [09:40:16] <_joe_> aude: https://gerrit.wikimedia.org/r/#/c/165466/ [09:40:38] <_joe_> did I do something wrong? [09:40:49] not in apaches ? [09:41:04] looks reorganized since i added one last time [09:41:10] <_joe_> I read on wikitech the dsh group was mediawiki-install [09:41:18] <_joe_> but lemme check again [09:41:20] i think so [09:42:30] <_joe_> aude: awww I got what I did wrong [09:42:34] patch coming [09:42:37] <_joe_> lemme run sync-common first [09:42:43] ok [09:42:44] <_joe_> thanks [09:42:54] <_joe_> it's been online for 5 days [09:43:14] <_joe_> thanks a lot for spotting this as well [09:43:15] (03PS1) 10Aude: Add mw1163 to mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/166541 [09:43:19] just was confused to see one of our bugs not fixed [09:43:30] according to the logs [09:44:29] (03PS2) 10Aude: Add mw1163 to mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/166541 [09:44:30] <_joe_> !log running sync-common on mw1163 [09:44:41] Logged the message, Master [09:44:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add mw1163 to mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/166541 (owner: 10Aude) [09:44:55] thanks [09:46:25] !log upload python-elasticsearch to trusty-wikimedia [09:46:29] Logged the message, Master [09:49:05] (03PS1) 10KartikMistry: Beta: Define cxserver port [puppet] - 10https://gerrit.wikimedia.org/r/166542 [09:57:55] (03PS1) 10Filippo Giunchedi: eqiad-prod: reduce weight on ms-be1013/1014/1015 to help shed some load [software/swift-ring] - 10https://gerrit.wikimedia.org/r/166544 [09:58:07] (03CR) 10Hashar: "Previously ferm::service would generates:" [puppet] - 10https://gerrit.wikimedia.org/r/166542 (owner: 10KartikMistry) [09:58:07] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:58:14] (03CR) 10Hashar: [C: 031] Beta: Define cxserver port [puppet] - 10https://gerrit.wikimedia.org/r/166542 (owner: 10KartikMistry) [10:04:08] paravoid: https://gerrit.wikimedia.org/r/166544 should help at least a bit to shed some load, did you come across a particularly busy file/partition? [10:38:50] !log enable container sync on non-sharded originals containers [10:38:57] Logged the message, Master [10:50:03] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6988.01305874 [10:59:33] (03CR) 10JanZerebecki: [C: 031] Change phab_update_tag script to remove library lock file [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [11:03:07] (03PS1) 10Giuseppe Lavagetto: beta: use hiera for nutcracker host list [puppet] - 10https://gerrit.wikimedia.org/r/166551 [11:06:34] (03CR) 10Giuseppe Lavagetto: [C: 032] beta: use hiera for nutcracker host list [puppet] - 10https://gerrit.wikimedia.org/r/166551 (owner: 10Giuseppe Lavagetto) [11:16:34] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:16:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:21:05] (03PS1) 10Giuseppe Lavagetto: puppet::self::master : fix hiera dir name [puppet] - 10https://gerrit.wikimedia.org/r/166552 [11:21:12] <_joe_> ^^ that's me and it's ok [11:21:26] (03PS2) 10Giuseppe Lavagetto: puppet::self::master : fix hiera dir name [puppet] - 10https://gerrit.wikimedia.org/r/166552 [11:22:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet::self::master : fix hiera dir name [puppet] - 10https://gerrit.wikimedia.org/r/166552 (owner: 10Giuseppe Lavagetto) [11:24:52] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:24:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:18:14] (03PS1) 10Springle: script for gently (re)building an eventlogging slave [software] - 10https://gerrit.wikimedia.org/r/166558 [12:21:20] (03CR) 10Springle: [C: 032] script for gently (re)building an eventlogging slave [software] - 10https://gerrit.wikimedia.org/r/166558 (owner: 10Springle) [12:25:33] !log Jenkins: upgrading Gearman plugin to fix jobs registrations ( cherry picked https://review.openstack.org/#/c/125755/ and compiled it via maven ). [12:25:38] Logged the message, Master [12:27:43] !log Jenkins restarting [12:27:47] Logged the message, Master [13:03:39] (03CR) 10Alexandros Kosiaris: "Minor typo, otherwise +2. I already built a test package out of it and all seems fine" (031 comment) [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/163548 (owner: 10KartikMistry) [13:03:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] Added initial Debian packaging [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/163548 (owner: 10KartikMistry) [13:55:59] !log enable container sync on non-commons sharded containers [13:56:05] Logged the message, Master [14:00:34] (03PS2) 10Reedy: Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 [14:33:46] zend_mm_heap corrupted [14:33:56] I give up using PHP / pear / composer etc [14:35:20] / computers [14:35:20] / computers [14:38:30] (03CR) 10Andrew Bogott: [C: 032] Labs: allow for growing volumes [puppet] - 10https://gerrit.wikimedia.org/r/166351 (owner: 10coren) [14:56:12] Computers are the worst. [14:56:18] And on that note, almost time for SWAT! [14:56:26] Sausages are the wurst. [14:56:55] Looks like anomie managed to stick in two patches despite being at the offsite thing [14:57:00] (03PS3) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/163548 [14:57:14] And James_F|Away has a patch in but is still |Away [14:57:28] akosiaris: ^ [14:57:40] marktraceur: Did it on Friday, had to wait until now. Moving phase2 to Wednesday will make things a bit more convenient. [14:57:53] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [14:58:11] anomie: Which patch, the wmf2 one? [14:58:30] marktraceur: They're the same patch, both branches [14:58:41] (03CR) 10KartikMistry: Added initial Debian packaging (031 comment) [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/163548 (owner: 10KartikMistry) [14:58:45] Right, but phase2 is Wikipedias, which have wmf2 I think [14:58:53] Unless you mean something else [14:59:37] I meant that people seem to have a habit of finding bugs once it hits enwiki, and by then it's often after I've ended for the day. [14:59:45] And no SWAT on Friday, so... [15:00:08] Right right [15:00:29] anomie: So only push the wmf3 patch today, do wmf2 tomorrow? [15:00:40] marktraceur: Huh? No, do both today [15:00:44] Oh, OK [15:00:59] Moving phase2 to Wednesday will make things a bit more convenient. [15:01:06] While I do it you can explain what you meant [15:01:11] James_F: You're ready too? [15:01:16] marktraceur: Did you not see Greg's email last week? [15:01:34] Which one? [15:01:49] (03PS2) 10Giuseppe Lavagetto: apache: fix diamond stats collections. [puppet] - 10https://gerrit.wikimedia.org/r/166538 [15:03:01] marktraceur: Yes. [15:03:36] (03CR) 10Filippo Giunchedi: [C: 04-1] Always use debian packaged texvc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [15:03:38] James_F: Sweet, thanks :) [15:03:44] I'm doing anomie's first. [15:03:49] * James_F nods. [15:03:59] marktraceur: Oh, 3 weeks ago. Time flies. [15:04:28] Oh, yeah, good idea [15:04:52] * marktraceur twiddles thumbs waiting for Jenkins [15:08:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/163548 (owner: 10KartikMistry) [15:09:18] Jenkins is worse than normal today [15:09:33] wtf is vendor-integration anyway [15:09:45] That's bd808S's amazing test-all-the-things pipeline. [15:09:46] (03CR) 10Reedy: Always use debian packaged texvc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [15:09:52] (03PS3) 10Giuseppe Lavagetto: apache: fix diamond stats collections. [puppet] - 10https://gerrit.wikimedia.org/r/166538 [15:09:58] Checks MW-core plus all extensions, or something. [15:10:05] Huh. [15:10:08] Well, wmf2 one is ready [15:11:00] (03CR) 10Filippo Giunchedi: Always use debian packaged texvc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [15:11:01] (03CR) 10Giuseppe Lavagetto: [C: 032] apache: fix diamond stats collections. [puppet] - 10https://gerrit.wikimedia.org/r/166538 (owner: 10Giuseppe Lavagetto) [15:11:36] !log marktraceur Synchronized php-1.25wmf2/includes/api/ApiQueryBacklinks.php: [SWAT] [wmf2] API: Fix ApiQueryBacklinks redirlinks (duration: 00m 06s) [15:11:38] anomie: Testy test? [15:11:42] Logged the message, Master [15:11:50] marktraceur: Confirmed [15:11:57] Sweet [15:12:00] Doing wmf3 [15:12:36] !log marktraceur Synchronized php-1.25wmf3/includes/api/ApiQueryBacklinks.php: [SWAT] [wmf3] API: Fix ApiQueryBacklinks redirlinks (duration: 00m 05s) [15:12:37] anomie: Test again :) [15:12:41] Logged the message, Master [15:12:47] marktraceur: Also confirmed [15:12:50] (03CR) 10GWicke: [C: 031] Allow the default rendering modes on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166410 (owner: 10Physikerwelt) [15:12:53] Awesome, you are free [15:13:16] James_F: OK, breaking changes to ooui? :) [15:13:25] marktraceur: Unbreaking. :-) [15:13:27] I assume the mailing list was informed [15:14:02] PROBLEM - RAID on db1050 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:14:24] marktraceur: 100% of subscribers to the mailing list were told, yes. [15:14:57] Wonderful [15:15:07] Someone has to keep you honest around here. ;) [15:15:16] marktraceur: :-P [15:15:29] 100 %? Did you check bounces? [15:15:52] PROBLEM - RAID on db1051 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:16:06] Nemo_bis: Pfft [15:16:21] hissssssss [15:16:25] Nemo_bis: 100% of 0 is 0. [15:16:40] * Reedy high fives James_F [15:18:23] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:19:08] Urghhhh testing [15:20:32] who needs tests? [15:21:03] greg-g: Agreed, they're stupid [15:21:55] marktraceur: Merged now! [15:21:57] Syncing oojs-ui [15:22:04] Only to wmf3 [15:22:11] !log marktraceur Synchronized php-1.25wmf3/resources/lib/oojs-ui/: [SWAT] [wmf3] OOjs UI: New pull-through to 837b2f733e to fix a missed dependency (duration: 00m 06s) [15:22:17] Logged the message, Master [15:23:10] Thanks. [15:23:34] James_F: Tested? [15:23:44] marktraceur: Waiting for debug=true to load. Yay bits. [15:24:49] KK [15:25:12] marktraceur: Success. Thanks! [15:25:23] marktraceur: Time to mark as {{done}}. :-) [15:26:02] Sweet. [15:26:09] Doing now. :) [15:27:50] <_joe_> !log reimaging mw1114 with HHVM - first server in the API pool; depooling and reinstalling now. [15:27:56] Logged the message, Master [15:30:32] Anyway, I declare SWAT closed [15:30:36] Best of luck Reedy [15:38:07] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [15:42:09] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 20 failures [15:50:09] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.005 second response time [15:56:00] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:59:39] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [16:00:00] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [16:00:02] bugzilla is buggy [16:00:29] periodic cannot connect to mysql errors [16:03:36] PROBLEM - RAID on mw1114 is CRITICAL: Connection refused by host [16:03:49] PROBLEM - check configured eth on mw1114 is CRITICAL: Connection refused by host [16:03:58] PROBLEM - check if dhclient is running on mw1114 is CRITICAL: Connection refused by host [16:04:17] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: Connection refused by host [16:04:36] PROBLEM - Apache HTTP on mw1114 is CRITICAL: Connection refused [16:04:37] PROBLEM - DPKG on mw1114 is CRITICAL: Connection refused by host [16:04:37] PROBLEM - nutcracker port on mw1114 is CRITICAL: Connection refused by host [16:04:46] PROBLEM - nutcracker process on mw1114 is CRITICAL: Connection refused by host [16:04:49] PROBLEM - Disk space on mw1114 is CRITICAL: Connection refused by host [16:04:57] PROBLEM - puppet last run on mw1114 is CRITICAL: Connection refused by host [16:05:37] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.003 second response time [16:08:15] (03PS1) 10Andrew Bogott: Move some nova scripts into a new class, openstack::adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/166587 [16:08:17] (03PS1) 10Andrew Bogott: Add some dumb but useful scripts for querying nova [puppet] - 10https://gerrit.wikimedia.org/r/166588 [16:08:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:09:01] (03CR) 10jenkins-bot: [V: 04-1] Move some nova scripts into a new class, openstack::adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/166587 (owner: 10Andrew Bogott) [16:09:05] (03CR) 10jenkins-bot: [V: 04-1] Add some dumb but useful scripts for querying nova [puppet] - 10https://gerrit.wikimedia.org/r/166588 (owner: 10Andrew Bogott) [16:10:36] RECOVERY - RAID on db1050 is OK: OK: optimal, 1 logical, 2 physical [16:10:48] (03PS2) 10Andrew Bogott: Add some dumb but useful scripts for querying nova [puppet] - 10https://gerrit.wikimedia.org/r/166588 [16:10:50] (03PS2) 10Andrew Bogott: Move some nova scripts into a new class, openstack::adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/166587 [16:11:05] (03PS1) 10Giuseppe Lavagetto: mediawiki: do hhvm rendering checks on any webserver [puppet] - 10https://gerrit.wikimedia.org/r/166592 [16:11:32] (03CR) 10jenkins-bot: [V: 04-1] Add some dumb but useful scripts for querying nova [puppet] - 10https://gerrit.wikimedia.org/r/166588 (owner: 10Andrew Bogott) [16:11:45] (03PS2) 10Giuseppe Lavagetto: mediawiki: do hhvm rendering checks on any webserver [puppet] - 10https://gerrit.wikimedia.org/r/166592 [16:13:02] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: do hhvm rendering checks on any webserver [puppet] - 10https://gerrit.wikimedia.org/r/166592 (owner: 10Giuseppe Lavagetto) [16:15:56] PROBLEM - NTP on mw1114 is CRITICAL: NTP CRITICAL: Offset unknown [16:16:54] (03CR) 10Rush: "A caveat of this approach is it is linking the update for the main repo's and the updates for sprint. If there is a tag change in puppet " [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [16:18:16] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [16:18:17] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:18:17] RECOVERY - Disk space on mw1114 is OK: DISK OK [16:18:23] RECOVERY - check configured eth on mw1114 is OK: NRPE: Unable to read output [16:18:37] RECOVERY - check if dhclient is running on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [16:18:48] RECOVERY - check if salt-minion is running on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:19:16] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [16:19:16] RECOVERY - DPKG on mw1114 is OK: All packages OK [16:19:37] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 1 failures [16:20:06] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:26:47] (03PS3) 10Andrew Bogott: Add some dumb but useful scripts for querying nova [puppet] - 10https://gerrit.wikimedia.org/r/166588 [16:26:49] (03PS3) 10Andrew Bogott: Move some nova scripts into a new class, openstack::adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/166587 [16:31:26] RECOVERY - NTP on mw1114 is OK: NTP OK: Offset -0.01499176025 secs [16:31:57] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:32:46] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 222968 msg: ocg_render_job_queue 1111 msg (=500 critical) [16:32:56] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 222977 msg: ocg_render_job_queue 1047 msg (=500 critical) [16:32:57] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 222983 msg: ocg_render_job_queue 1020 msg (=500 critical) [16:33:52] <_joe_> !log repooling mw1114 [16:33:59] Logged the message, Master [16:34:57] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:35:46] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 223161 msg: ocg_render_job_queue 0 msg [16:35:56] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 223179 msg: ocg_render_job_queue 0 msg [16:36:09] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 223183 msg: ocg_render_job_queue 0 msg [16:41:42] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 10 failures [16:52:31] (03PS3) 10Reedy: Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 [16:52:53] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:55:01] (03PS4) 10Reedy: Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 [16:58:02] (03CR) 10Filippo Giunchedi: [C: 031] Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [16:59:51] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:00:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 654 [17:05:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 904 [17:06:44] Hmm [17:06:44] Error: Unknown MySQL server host 'db1001.eqiad.wmnet' (111) [17:06:48] Just got that from bz [17:10:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1148 [17:15:12] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1378 [17:20:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1606 [17:22:55] (03CR) 10Andrew Bogott: [C: 032] Move some nova scripts into a new class, openstack::adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/166587 (owner: 10Andrew Bogott) [17:25:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1492 [17:25:36] (03CR) 10Andrew Bogott: [C: 032] Add some dumb but useful scripts for querying nova [puppet] - 10https://gerrit.wikimedia.org/r/166588 (owner: 10Andrew Bogott) [17:30:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1352 [17:35:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1305 [17:38:00] ^demon|away: given there's no otto and nik and your idle time, I'm assuming we're skipping the meeting :) [17:40:18] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1552 [17:45:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1696 [17:50:14] RECOVERY - check_mysql on lutetium is OK: Uptime: 1799260 Threads: 3 Questions: 18891694 Slow queries: 10545 Opens: 23130 Flush tables: 2 Open tables: 64 Queries per second avg: 10.499 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:52:27] !log conntrack full on virt1000 and zirconium, suspected diamond collector runaway [17:52:36] Logged the message, Master [17:57:41] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 524.099976 [17:59:31] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 455.966675 [18:01:37] * Reedy kicks jouncebot [18:02:34] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 6 failures [18:03:59] jouncebot: reload [18:09:52] (03PS1) 10Subramanya Sastry: Get betalabs localsettings.js file from deploy repo (just like prod) [puppet] - 10https://gerrit.wikimedia.org/r/166610 [18:09:56] (03CR) 10JanZerebecki: "I don't think it is a good idea to merge changes in operations/puppet.git and then not deploy them. If you never do that linking the deplo" [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [18:11:22] (03CR) 10Subramanya Sastry: "Not to be merged till https://gerrit.wikimedia.org/r/#/c/166608 has been reviewed, merged, and deployed to betalabs." [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [18:15:10] !log stop diamond on virt1000 and zirconium to test [18:15:16] Logged the message, Master [18:17:32] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:18:32] (03CR) 10Jgreen: "The diff itself is OK, but we need to bump the package's version ID and make sure that it installs over the earlier package version." [software/otrs] - 10https://gerrit.wikimedia.org/r/165472 (https://bugzilla.wikimedia.org/59950) (owner: 10Alex Monk) [18:18:51] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:19:58] (03PS1) 10Reedy: Non wikipedias to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166615 [18:20:32] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [18:20:46] !log reedy Purged l10n cache for 1.25wmf1 [18:20:53] Logged the message, Master [18:25:13] (03PS5) 10Spage: Set group for /srv/mediawiki on singlenode mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/79955 (https://bugzilla.wikimedia.org/72046) (owner: 10Mattflaschen) [18:25:20] (03CR) 10jenkins-bot: [V: 04-1] Set group for /srv/mediawiki on singlenode mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/79955 (https://bugzilla.wikimedia.org/72046) (owner: 10Mattflaschen) [18:25:52] uh, is bugzilla dead? [18:25:59] hmm, [18:26:01] works on refresh [18:26:55] (03PS1) 10Giuseppe Lavagetto: apache: disable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/166617 [18:27:12] haha that's puppet coming around, starting diamond and blowing conntrack on zirconium [18:27:16] (03CR) 10Yuvipanda: "I think we should kill mediawiki singlenode with fire and just use labs-vagrant..." [puppet] - 10https://gerrit.wikimedia.org/r/79955 (https://bugzilla.wikimedia.org/72046) (owner: 10Mattflaschen) [18:27:35] (03CR) 10Giuseppe Lavagetto: [C: 032] apache: disable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/166617 (owner: 10Giuseppe Lavagetto) [18:27:52] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 339.033325 [18:29:11] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 683.533325 [18:29:51] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 491.166656 [18:29:51] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:41:45] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.25wmf3 [18:41:52] Logged the message, Master [18:42:59] (03PS1) 10Reedy: Remove 1.24wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166619 [18:43:31] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166615 (owner: 10Reedy) [18:43:41] (03Merged) 10jenkins-bot: Non wikipedias to 1.25wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166615 (owner: 10Reedy) [18:44:05] (03CR) 10Reedy: [C: 032] Remove 1.24wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166619 (owner: 10Reedy) [18:44:11] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:44:14] (03Merged) 10jenkins-bot: Remove 1.24wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166619 (owner: 10Reedy) [18:44:19] Reedy: hmm, does wikitech ride the train? [18:44:22] Nope [18:44:25] Well, yes [18:44:31] But it doesn't take affect till sync-common [18:44:32] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:44:34] hmm [18:45:37] (03PS4) 10Reedy: Allow the default rendering modes on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166410 (owner: 10Physikerwelt) [18:45:41] (03CR) 10Reedy: [C: 032] Allow the default rendering modes on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166410 (owner: 10Physikerwelt) [18:45:49] (03Merged) 10jenkins-bot: Allow the default rendering modes on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166410 (owner: 10Physikerwelt) [18:46:01] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:46:39] (03PS1) 10Andrew Bogott: Another attempt to fix mw-xml.sh for the new wiki setup [puppet] - 10https://gerrit.wikimedia.org/r/166621 [18:47:07] (03PS5) 10Reedy: Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 [18:47:16] (03CR) 10Reedy: [C: 032] Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [18:47:23] (03Merged) 10jenkins-bot: Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [18:47:52] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:48:12] (03PS2) 10Reedy: Add HiDPI PNG variants for 'A Wikimedia Project' logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166417 (https://bugzilla.wikimedia.org/63872) (owner: 10Brion VIBBER) [18:49:20] (03CR) 10Andrew Bogott: [V: 032] Another attempt to fix mw-xml.sh for the new wiki setup [puppet] - 10https://gerrit.wikimedia.org/r/166621 (owner: 10Andrew Bogott) [18:49:26] (03CR) 10Reedy: [C: 032] Add HiDPI PNG variants for 'A Wikimedia Project' logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166417 (https://bugzilla.wikimedia.org/63872) (owner: 10Brion VIBBER) [18:49:33] (03Merged) 10jenkins-bot: Add HiDPI PNG variants for 'A Wikimedia Project' logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166417 (https://bugzilla.wikimedia.org/63872) (owner: 10Brion VIBBER) [18:49:49] (03CR) 10Andrew Bogott: [C: 032] Another attempt to fix mw-xml.sh for the new wiki setup [puppet] - 10https://gerrit.wikimedia.org/r/166621 (owner: 10Andrew Bogott) [18:52:14] !log Purged php-1.24wmf18 from mediawiki-appservers [18:52:19] Logged the message, Master [18:52:54] (03CR) 10Reedy: "Do the DNS scripts still use this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166281 (https://bugzilla.wikimedia.org/43697) (owner: 10TTO) [18:53:16] (03Abandoned) 10Reedy: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [18:53:27] oh [18:53:27] doh [18:53:49] (03PS3) 10Reedy: Add several domains to wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166176 (https://bugzilla.wikimedia.org/71195) (owner: 10Glaisher) [18:53:53] (03CR) 10GWicke: [C: 031] "Looks good to me. This is complemented by https://gerrit.wikimedia.org/r/#/c/166137/." [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [18:53:55] (03CR) 10Reedy: [C: 032] Add several domains to wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166176 (https://bugzilla.wikimedia.org/71195) (owner: 10Glaisher) [18:54:01] (03CR) 10Rush: "I don't disagree, it's more of a public service announcement to the sprint peoples that if they are wanting to push and there is a snafu w" [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [18:54:03] (03Merged) 10jenkins-bot: Add several domains to wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166176 (https://bugzilla.wikimedia.org/71195) (owner: 10Glaisher) [18:55:01] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 747.5 [18:55:30] (03PS2) 10Reedy: Remove $wgCategoryTreeDynamicTag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164015 (owner: 10PleaseStand) [18:55:37] (03CR) 10Reedy: [C: 032] Remove $wgCategoryTreeDynamicTag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164015 (owner: 10PleaseStand) [18:55:42] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 4 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:55:44] (03Merged) 10jenkins-bot: Remove $wgCategoryTreeDynamicTag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164015 (owner: 10PleaseStand) [18:56:12] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 668.866638 [18:57:15] (03PS2) 10Reedy: Remove various settings removed in mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164014 (owner: 10PleaseStand) [18:57:22] (03CR) 10Reedy: [C: 032] Remove various settings removed in mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164014 (owner: 10PleaseStand) [18:57:29] (03Merged) 10jenkins-bot: Remove various settings removed in mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164014 (owner: 10PleaseStand) [18:58:46] (03CR) 10Reedy: [C: 04-1] "-1'ing as scheduled for next month" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) (owner: 10Jforrester) [19:00:01] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 392.5 [19:00:27] (03PS2) 10Reedy: Remove $wgUploadWizardConfig['disableResourceLoader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164017 (owner: 10PleaseStand) [19:00:38] (03CR) 10Reedy: [C: 032] Remove $wgUploadWizardConfig['disableResourceLoader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164017 (owner: 10PleaseStand) [19:00:45] (03Merged) 10jenkins-bot: Remove $wgUploadWizardConfig['disableResourceLoader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164017 (owner: 10PleaseStand) [19:01:27] (03Restored) 10Hoo man: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [19:01:55] (03PS2) 10Alex Monk: Znuny4OTRS-WikimediaDTL: Unbreak read-only customer ID [software/otrs] - 10https://gerrit.wikimedia.org/r/165472 (https://bugzilla.wikimedia.org/59950) [19:02:01] (03PS2) 10Hoo man: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 [19:03:10] Reedy: ^ feel free to push out now [19:03:17] can also do it myself once your done [19:03:17] hoo: what's the change? [19:03:20] * you're [19:03:29] https://meta.wikimedia.org/w/index.php?title=Interwiki_map&action=history [19:03:33] last 2 entries [19:03:45] exactly 18 bytes, same size change for the cdb [19:03:50] didn't bother to compare them further [19:04:18] aha, ok [19:04:34] (03PS3) 10Reedy: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [19:04:39] (03CR) 10Reedy: [C: 032] Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [19:04:48] (03Merged) 10jenkins-bot: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [19:05:24] Would be nice if gerrit could show size change for binary files [19:05:26] (03PS2) 10Reedy: Remove $wgNoticeRunMessageIndexRebuildJobImmediately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164016 (owner: 10PleaseStand) [19:05:34] (03CR) 10Reedy: [C: 032] Remove $wgNoticeRunMessageIndexRebuildJobImmediately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164016 (owner: 10PleaseStand) [19:05:41] (03Merged) 10jenkins-bot: Remove $wgNoticeRunMessageIndexRebuildJobImmediately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164016 (owner: 10PleaseStand) [19:07:11] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:07:38] !log reedy Synchronized images/: (no message) (duration: 00m 14s) [19:07:44] Logged the message, Master [19:07:45] (03PS1) 10Mattflaschen: Add dedicated Flow sandbox for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166624 [19:08:13] (03CR) 10Mattflaschen: [C: 04-1] "Do not merge yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166624 (owner: 10Mattflaschen) [19:09:33] (03PS4) 10Reedy: Enhanced recent changes: explicitly disable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (https://bugzilla.wikimedia.org/35785) (owner: 10Nemo bis) [19:09:37] (03CR) 10Reedy: [C: 032] Enhanced recent changes: explicitly disable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (https://bugzilla.wikimedia.org/35785) (owner: 10Nemo bis) [19:09:46] (03Merged) 10jenkins-bot: Enhanced recent changes: explicitly disable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (https://bugzilla.wikimedia.org/35785) (owner: 10Nemo bis) [19:11:23] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:11:57] (03CR) 10Reedy: "At a quick glance, this needs to wait another week or so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164013 (owner: 10PleaseStand) [19:13:12] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:14:42] (03PS2) 10Reedy: Add "recommended article" and "featured list" badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166144 (https://bugzilla.wikimedia.org/70268) (owner: 10Bene) [19:15:01] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:15:27] (03CR) 10Reedy: [C: 032] Add "recommended article" and "featured list" badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166144 (https://bugzilla.wikimedia.org/70268) (owner: 10Bene) [19:15:34] (03Merged) 10jenkins-bot: Add "recommended article" and "featured list" badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166144 (https://bugzilla.wikimedia.org/70268) (owner: 10Bene) [19:16:12] woot [19:16:14] :P [19:16:26] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 17s) [19:16:32] Logged the message, Master [19:20:33] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 654.833313 [19:22:03] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 309.366669 [19:24:11] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 825.366638 [19:25:12] (03CR) 10Jgreen: [C: 032 V: 031] Znuny4OTRS-WikimediaDTL: Unbreak read-only customer ID [software/otrs] - 10https://gerrit.wikimedia.org/r/165472 (https://bugzilla.wikimedia.org/59950) (owner: 10Alex Monk) [19:25:26] ^^ varnishkafka: known issue, i'm investigating [19:26:27] (03CR) 10Jgreen: [V: 032] Znuny4OTRS-WikimediaDTL: Unbreak read-only customer ID [software/otrs] - 10https://gerrit.wikimedia.org/r/165472 (https://bugzilla.wikimedia.org/59950) (owner: 10Alex Monk) [19:35:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:37:14] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:39:11] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:44:41] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 681.033325 [19:46:11] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 441.799988 [19:48:13] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 610.866638 [19:50:33] ACKNOWLEDGEMENT - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 610.866638 Jeff Gage under investigation [19:50:33] ACKNOWLEDGEMENT - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1624.5 Jeff Gage under investigation [19:50:33] ACKNOWLEDGEMENT - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 847.400024 Jeff Gage under investigation [19:53:41] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:57:11] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:01:12] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:11:43] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 647.099976 [20:13:57] ^^ still watching varnishkafka. corresponds with daily traffic peak on cr1-esams. i've shut up the alerts for now. [20:31:27] gwicke: poke [20:33:25] JohnLewis: hi [20:35:11] gwicke: Yuvi said you're the guy to poke about the sca01 instance in beta labs, right? (mathiod) [20:35:24] mathoid *and* citoid [20:35:31] ^ that then :) [20:35:49] I set up the sca01 instance [20:35:56] But ideally gwicke's team would own it longer term [20:36:40] RoanKattouw: it needs to be killed and recreated in short so if gwicke is best to handle it/advise or you want to - either works :) [20:36:40] RoanKattouw: ah, right. so instance is dead, and icinga is complaining, and it doesn't respond to ping or ssh... [20:37:01] JohnLewis: yes, currently Roan is doing all the citoid work & I do the mathoid half; in the longer term the services team will do both [20:37:19] what happened to it? [20:37:50] haven't paid attention over the weekend, on Friday it seemed to be healthy [20:38:05] gwicke: We don't know neither does andrewbogott (who advised killing/recreating is best here) [20:38:25] though Andrew would have more knowledge about it than us :) [20:38:34] The puppetization should just work [20:38:40] So you should be able to kill it and recreate it [20:38:57] RoanKattouw: which roles need to be enabled and I'll do it now [20:39:00] modulo perhaps some trebuchet hick-ups [20:39:24] JohnLewis: Same ones as the old instnace [20:39:32] RoanKattouw: alright :) [20:40:42] did we try a simple reboot yet? [20:41:10] am a bit worried that it could be a hw issue [20:41:27] gwicke: Yeah [20:41:56] produced the same issue which seems rooted in the instance itself as opposed to the processes on it [20:42:12] okay [20:43:19] lets hope a reinstall resolves it & there is no deeper software or hw issue [20:43:32] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [20:43:35] gwicke: other instances seem ok though [20:44:00] could still be some race condition [21:00:47] gwicke: icinga seems happy with deployment-sca01 now - but unsure if it is actually doing what it should be doing :) [21:01:02] JohnLewis: let me try [21:01:23] gwicke: whis is a new sca01 the old one was killed :) [21:02:24] wait, deployment-sca01? [21:02:31] I just tested sca1001 [21:02:42] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:03:14] afk for a bit, will check back later [21:06:08] gwicke: yes, deployment-sca01, this is on betalabs [21:08:00] k; nm then about everything I said about hw [21:08:37] gwicke: ah, heh, ok [21:12:40] (03PS1) 10ArielGlenn: labs beta: make a class that explicitly has test repo plus ferm ssh rule [puppet] - 10https://gerrit.wikimedia.org/r/166672 [21:15:25] RoanKattouw: gwicke so, on deployment-sca01, zotero fails to start, mathoid user fails to be added, citoid fails to start, etc, and things cascade from there [21:15:50] (03PS2) 10ArielGlenn: labs beta: make a class that explicitly has test repo plus ferm ssh rule [puppet] - 10https://gerrit.wikimedia.org/r/166672 [21:16:39] really thought I would be able to avoid the rebase [21:17:35] YuviPanda: Yeah it needs an initial git-deploy [21:17:40] Welcome to Trebuchet [21:17:41] (03CR) 10ArielGlenn: [C: 032] labs beta: make a class that explicitly has test repo plus ferm ssh rule [puppet] - 10https://gerrit.wikimedia.org/r/166672 (owner: 10ArielGlenn) [21:17:45] Where things don't work unless you poke them five times [21:17:55] RoanKattouw: aaah, that. I've managed to keep myself out of it so far... [21:18:06] I wonder if using trebuchet as the 'provider' in puppet owuld alleviate that [21:18:14] (actually it's a lot better nowadays, where you only need one git deploy run and it all works perfectly; but still, just running puppet isn't enough) [21:18:38] * apergos lurks [21:18:52] I am interested in hearing about trebuchet issues s they come up [21:19:07] YuviPanda: What you can try is: on deployment-bastion, do cd /srv/deployment/citoid/deploy ; git deploy start ; git deploy sync ; then check that /srv/deployment/citoid has come into existence on sca01, if it has, sudo service citoid restart [21:19:21] JohnLewis: ^ do you want to do that? :D I'm juggling two other things as well... [21:19:25] Then check tail /var/log/citoid/main.log to see if it started without an error [21:19:51] YuviPanda: one of my two things is this so sure. [21:20:05] JohnLewis: :D [21:20:07] JohnLewis: thanks! [21:20:23] RoanKattouw: fwiw, vague amounts of alerting at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon (and are spewed into #wikimedia-qa [21:20:26] ) [21:24:56] RoanKattouw: keep getting '0/1 minions completed checkout' with the d c y n r options [21:25:05] Ugh [21:25:12] Oh, wait, this sca01 instance is new, rihgt? [21:25:18] Then you need to accept its salt keys on the salt master [21:25:35] Urg [21:25:45] JohnLewis: See step 7 of https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Converting_a_host_to_use_local_puppetmaster_and_salt_master [21:32:33] hm there's three unaccepted keys over there [21:32:40] some people have been forgetting to do that [21:33:08] (I"m on there tweaking the trebuchet test setup, will go to bed shortly though) [21:33:21] apergos: three? I saw two a few moments ago [21:33:38] Unaccepted Keys: [21:33:38] i-0000068c.eqiad.wmflabs [21:33:38] i-0000068d.eqiad.wmflabs [21:33:38] i-000006a7.eqiad.wmflabs [21:33:42] when I just looked [21:33:49] * JohnLewis checks [21:33:59] The latter one should have been accepted [21:34:04] maybe so [21:34:21] nope [21:34:33] apergos: accepted now :) [21:34:47] yup! [21:40:20] YuviPanda / RoanKattouw / gwicke: Citoid started on deployment-sca01! :) [21:40:26] yay [21:40:30] JohnLewis: yay! :) force a puppet run? [21:40:51] YuviPanda: on which host? [21:40:57] JohnLewis: deployment-sca01? [21:41:03] also what does sca expand to, RoanKattouw? [21:41:05] already doing now [21:41:13] cool [21:41:26] YuviPanda: Super Citoid Arena ;) [21:41:33] haha :) [21:41:44] YuviPanda: Services Cluster A or something? I don't know. It was named by gwicke and/or akosiaris [21:41:49] aaah [21:41:57] yeah, I remember seeing the raw machines go by [21:42:08] SCA is a Dutch toiletpaper company. ;) [21:42:24] Ehm, Swedisch [21:42:30] But a lot of factories here. :D [21:44:19] ok giving up for the night, it is almost 1 am... [21:44:26] see folks tomorrow [21:47:53] HHVM question at https://www.mediawiki.org/wiki/Topic:S331i65ytjixdtuf#flow-post-s40ct3rq4lvjygd9 that I don't understand well enough to answer. They ask: "Will there, however, perhaps still be a different way to see aggregate statistics collected from all users of instances of HHVM included as part of installations of MediaWiki in the software's final implementation of its use of HHVM?" - Anyone? [21:48:31] (03PS1) 10Yuvipanda: dynamicproxy: Compress text/javascript as well [puppet] - 10https://gerrit.wikimedia.org/r/166676 [21:48:40] andrewbogott: can you merge ^? trivial [21:50:46] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Compress text/javascript as well [puppet] - 10https://gerrit.wikimedia.org/r/166676 (owner: 10Yuvipanda) [21:50:54] (03Abandoned) 10Mattflaschen: Add dedicated Flow sandbox for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166624 (owner: 10Mattflaschen) [21:50:58] andrewbogott: ty! [21:53:29] (03CR) 10Krinkle: "Fixes https://bugzilla.wikimedia.org/show_bug.cgi?id=71995" [puppet] - 10https://gerrit.wikimedia.org/r/166676 (owner: 10Yuvipanda) [22:19:10] paravoid: did I do that right? https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=131129&oldid=131096 [22:23:16] can someone graceful apache on mw1075 ? [22:23:50] aude: APC? [22:24:01] yep [22:24:05] https://bugzilla.wikimedia.org/show_bug.cgi?id=72053 [22:24:12] and the logs [22:24:22] !log Gracefulled apache on mw1075 [22:24:31] Logged the message, Master [22:24:34] what, we can do it? [22:24:40] Yep [22:24:44] ok :) [22:24:50] I studied the sudo rules and we have full access to apache2ctl [22:25:13] nice [22:25:36] RoanKattouw, marktraceur, MaxSem: is it ok if I just leave this here? https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=131129&oldid=131096 [22:26:08] arlolra: Yes, as long as you're on line during the SWAT window [22:26:30] yup [22:26:30] The person doing the deployment will ping you around 4pm PDT, and if you don't respond, your patch gets bumped to the next window [22:26:43] I'll be unable to deploy today, btw [22:26:52] hmm, ok [22:27:46] andrewbogott: btw, why can't wikidev do sync common on wikitech? [22:27:57] is that intentional or have you just not gotten around to it... [22:29:08] YuviPanda|zzz: Just because wikitech is fragile and I'm a control freak. [22:29:16] Someday it'll be on its own host... [22:29:17] andrewbogott: hahaha :D ok. [22:29:26] or we'll not be using it... [22:29:34] YuviPanda|zzz, RoanKattouw, sjoerddebruin: SCA is indeed not for the toilet paper, but for for Services Cluster A; main reason for the A bit is that there is already a services cluster for stuff like bugzilla [22:29:36] hmm, I should sign up for helping with horizon when we start doing it [22:29:52] :) [22:30:49] * gwicke follows team sca in the volvo ocean race [22:31:02] http://www.teamsca.com/ [22:38:03] what a shit team ;) [22:47:56] (03PS1) 10John F. Lewis: mailman: add more template translations [puppet] - 10https://gerrit.wikimedia.org/r/166686 [22:50:29] (03CR) 10Dzahn: "this would likely also help with RT #8539 which requests renaming "fiu-vro" to just "vro"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [22:51:33] (03CR) 10Dzahn: [C: 031] "now i agree with "Anything that allows us to easily rename wikis (i.e. move subdomains) makes me happy, even if it is slightly evil!"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [22:52:27] (03PS1) 10Calak: Lots of rights changes for huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166687 (https://bugzilla.wikimedia.org/72055) [22:54:02] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:03:54] jouncebot: die [23:04:15] The bot has failed to announce SWAT again [23:05:38] Hm.. webproxy.pmtpa.wmnet / carbon is still up? [23:05:40] Interesting [23:06:19] RoanKattouw: it listens to die? [23:06:25] RoanKattouw: does it restart? [23:06:35] ;D [23:07:52] (03PS1) 10Chad: Configure Elasticsearch for statsd [puppet] - 10https://gerrit.wikimedia.org/r/166690 [23:08:09] (03PS2) 10Calak: Lots of rights changes for huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166687 (https://bugzilla.wikimedia.org/72055) [23:08:12] It doesn't self-restart :( I can start it again I think... [23:09:59] jouncebot: next [23:09:59] In 15 hour(s) and 50 minute(s): SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141015T1500) [23:11:03] OK, so who's doing SWAT [23:11:12] I'm a bit busy right now so I'd prefer not to do it [23:12:31] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:13:52] OK I guess I'm doing it then [23:16:10] arlolra: You here for your SWAT deploy? [23:16:19] yup [23:16:28] Cool [23:18:03] RoanKattouw: I don't know much about this process ... I just want to get these three changes deployed https://git.wikimedia.org/history/mediawiki%2Fextensions%2FCollection.git/HEAD [23:18:16] arlolra: Yeah I'm already on it [23:18:29] You linked one that depended on two others so I'm doing all three [23:18:55] ok, great [23:19:21] is wmf2 right? will wmf3 be rebased with these cahnges? [23:19:43] No wmf3 will have to be updated separately [23:19:55] ok, should I be doing that now? [23:20:01] Yes please [23:20:10] Good catch [23:20:17] I assumed these things had been merged before Thursday [23:20:35] no, from today [23:21:14] Right, OK, then yes it does need to go into wmf3 as well [23:23:31] https://gerrit.wikimedia.org/r/#/c/166696/ [23:29:50] Thanks [23:33:43] sorry, crashed [23:39:38] arlolra: OK sorry for the delay, I blame my PM for distracting me with spreadsheets [23:39:44] Now going to actually deploy these things [23:39:56] RoanKattouw: np [23:42:16] !log catrope Synchronized php-1.25wmf2/extensions/Collection: SWAT (duration: 00m 04s) [23:42:21] Logged the message, Master [23:42:24] !log catrope Synchronized php-1.25wmf3/extensions/Collection: SWAT (duration: 00m 08s) [23:42:30] Logged the message, Master [23:42:58] arlolra: ---^^ [23:43:03] Please confirm that that worked [23:44:49] RoanKattouw: looks good [23:44:52] thanks :) [23:44:55] Thanks