[00:38:32] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:05] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:12] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 70137 bytes in 0.112 second response time [00:42:43] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time [01:46:49] 06Operations, 06Labs, 13Patch-For-Review: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2274128 (10bd808) [01:51:59] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [02:20:38] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:24:17] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 53s) [02:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:40] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 08m 40s) [02:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:43] 06Operations, 10DBA, 10Phabricator, 10Phabricator-Upstream: Project icon files are missing - https://phabricator.wikimedia.org/T128160#2274140 (10Danny_B) Perhaps https://secure.phabricator.com/T10907 [02:51:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 9 02:51:58 UTC 2016 (duration 9m 18s) [02:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:25:49] 06Operations, 10Ops-Access-Requests: Root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274165 (10madhuvishy) [03:27:27] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274178 (10madhuvishy) [03:37:30] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274179 (10yuvipanda) +1 as the person who requisitioned the notebook servers. [03:38:52] grrrit-wm: welcome back [03:42:29] twentyafterfour: thx for post at phab's phab - i was not sure how much is allowed to forward it [03:43:13] (03PS1) 10Yuvipanda: k8s: Bump up docker version [puppet] - 10https://gerrit.wikimedia.org/r/287562 [03:43:15] (03PS1) 10Yuvipanda: tools: Use our own k8s pod container [puppet] - 10https://gerrit.wikimedia.org/r/287563 (https://phabricator.wikimedia.org/T133873) [03:44:46] (03CR) 10Yuvipanda: [C: 032] k8s: Bump up docker version [puppet] - 10https://gerrit.wikimedia.org/r/287562 (owner: 10Yuvipanda) [03:45:01] (03CR) 10Yuvipanda: [C: 032] tools: Use our own k8s pod container [puppet] - 10https://gerrit.wikimedia.org/r/287563 (https://phabricator.wikimedia.org/T133873) (owner: 10Yuvipanda) [03:52:11] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274182 (10Nuria) Approved [03:58:40] (03PS1) 10Yuvipanda: k8s: Update docker systemd unit for latest version [puppet] - 10https://gerrit.wikimedia.org/r/287564 [04:00:16] (03CR) 10Yuvipanda: [C: 032] k8s: Update docker systemd unit for latest version [puppet] - 10https://gerrit.wikimedia.org/r/287564 (owner: 10Yuvipanda) [04:47:33] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:49:23] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [04:56:45] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:44] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:32] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time [05:00:32] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 70243 bytes in 0.111 second response time [05:14:12] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:04] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:16:53] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.661 second response time [05:17:43] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 70243 bytes in 0.132 second response time [05:23:51] <_joe_> what's up with mw1017? [05:35:24] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2274270 (10bearND) [05:40:53] (03PS1) 10Ladsgroup: Enable CORS for ORES regardless of response code [puppet] - 10https://gerrit.wikimedia.org/r/287566 [05:41:30] (03PS2) 10Ladsgroup: Enable CORS for ORES regardless of response code [puppet] - 10https://gerrit.wikimedia.org/r/287566 (https://phabricator.wikimedia.org/T119325) [06:11:21] !log restarting elasticsearch server elastic1017.eqiad.wmnet (T110236) [06:11:22] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [06:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:28:35] (03PS2) 10Muehlenhoff: Add amire80 to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/287179 (https://phabricator.wikimedia.org/T122524) [06:30:43] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:52] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:31:13] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [06:32:02] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:40] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:51] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:11] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:50] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:19] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10Riley_Huntley) https://commons.wikimedia.org/w/index.php?title=File:2_Ke... [06:34:31] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:10] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:21] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [06:38:23] (03PS1) 10Ladsgroup: [WIP] wikilabels: add nginx proxy, enable CORS, support caching [puppet] - 10https://gerrit.wikimedia.org/r/287570 [06:44:03] someone fix https://phabricator.wikimedia.org/T132921 [06:44:09] or face an angry ~riley [06:48:32] !log powercycling pc2006 (crashed) [06:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:53:31] RECOVERY - Host pc2006 is UP: PING OK - Packet loss = 0%, RTA = 37.03 ms [06:55:51] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:10] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:31] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:31] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:40] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:21] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:58:21] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:21] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:59:32] 06Operations, 10Mail, 10MediaWiki-Email: Wiki-Mail sent but never delivered - https://phabricator.wikimedia.org/T134674#2274494 (10Syum90) Hi, I can also share my email details if needed. [07:04:55] (03CR) 10Muehlenhoff: [C: 031] "Waiting peroid has passed." [puppet] - 10https://gerrit.wikimedia.org/r/287179 (https://phabricator.wikimedia.org/T122524) (owner: 10Muehlenhoff) [07:06:21] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add amire80 to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/287179 (https://phabricator.wikimedia.org/T122524) (owner: 10Muehlenhoff) [07:14:38] RECOVERY - Disk space on elastic1016 is OK: DISK OK [07:16:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [07:17:48] (03PS3) 10Yuvipanda: ores: Enable CORS regardless of response code [puppet] - 10https://gerrit.wikimedia.org/r/287566 (https://phabricator.wikimedia.org/T119325) (owner: 10Ladsgroup) [07:18:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5984767 keys - replication_delay is 0 [07:18:10] Amir1: around? I can merge it now if you can 'babysit' (aka watch puppet run and make sure nothing breaks) [07:18:40] hey YuviPanda [07:18:41] sure [07:18:49] (03PS4) 10Yuvipanda: ores: Enable CORS regardless of response code [puppet] - 10https://gerrit.wikimedia.org/r/287566 (https://phabricator.wikimedia.org/T119325) (owner: 10Ladsgroup) [07:18:57] thanks :) [07:18:57] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Enable CORS regardless of response code [puppet] - 10https://gerrit.wikimedia.org/r/287566 (https://phabricator.wikimedia.org/T119325) (owner: 10Ladsgroup) [07:19:33] YuviPanda: I want to enable icigna irc reports for ores [07:19:39] it would be great if we can do it [07:19:47] 06Operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation, and 4 others: schedule a daily run of ContentTranslation analytics scripts - https://phabricator.wikimedia.org/T122479#2274511 (10MoritzMuehlenhoff) [07:19:50] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add amire80 to analytics-privatedata-users group - https://phabricator.wikimedia.org/T122524#2274508 (10MoritzMuehlenhoff) 05Open>03Resolved @Amire80 : I've merged the patch, let me know if you run into any... [07:19:55] Amir1: what kind of ones? [07:20:04] Amir1: we already have some icinga pages for ORES [07:20:07] (in #wikimedia-ai obviously, not here) [07:20:16] " RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5984767 keys - replication_delay is 0" [07:20:17] Amir1: they hit this channel, and halfak and my mobile number [07:20:32] hmm [07:20:48] YuviPanda: is it possible that these hit #wikimedia-ai? [07:21:01] Amir1: we can probably have it output in -ai. File a bug? usually mutante knows most about those things [07:21:17] yeah sure [07:21:19] thanks :) [07:22:14] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274512 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:25:01] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10MoritzMuehlenhoff) @hashar: This bug can be closed, right? It seems T131749 fixed the remaining dependency problems. [07:35:10] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10jcrespo) I wanted to discard data drift between servers (an issue I foun... [07:38:36] !log restarting elasticsearch server elastic1018.eqiad.wmnet (T110236) [07:38:37] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [07:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:40:17] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:07] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [07:44:18] YuviPanda: everything works okay [07:44:27] Amir1: \o/ cool [07:44:28] CORS is enabled for 400ish errors too [07:44:46] Amir1: remember this won't translate to production (no nginx there) so you'll have to redo these in some form [07:45:31] hmm [07:45:34] thanks for the tip [07:45:39] I will look into this [07:45:53] or maybe we don't need since domain is the same in prod [07:47:55] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2274551 (10jcrespo) From the infrastructure point of view, it looks as if the file... [07:49:08] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [07:51:46] Amir1: maybe, yeah. we'll see I guess [07:55:20] (03PS1) 10Jcrespo: Repool db1070, increase weight of previously depooled servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287571 [07:57:59] (03PS2) 10Jcrespo: Repool db1070, increase weight of previously depooled servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287571 [07:58:56] (03CR) 10Jcrespo: [C: 032] Repool db1070, increase weight of previously depooled servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287571 (owner: 10Jcrespo) [07:59:08] (03PS2) 10Jcrespo: Retire db1058 from the service group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287224 (https://phabricator.wikimedia.org/T134360) [07:59:57] (03CR) 10Jcrespo: [C: 032] Retire db1058 from the service group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287224 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [08:07:18] (03PS1) 10Yuvipanda: Add a registry enforcer + tests [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 [08:13:52] !log jynus@tin Synchronized wmf-config/db-codfw.php: Retire db1058 server (duration: 00m 39s) [08:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 with low weight, weight increases, retire db1058 (duration: 00m 30s) [08:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:32] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:17:14] (03PS2) 10Elukey: Restore basic memcached settings to mc1009 as part of a performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287237 (https://phabricator.wikimedia.org/T129963) [08:18:22] !log restarting elasticsearch server elastic1019.eqiad.wmnet (T110236) [08:18:22] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:55] Krenair: I have not gotten another message since we talked. I have however encountered https://phabricator.wikimedia.org/T134729 but that has (most likely) nothing to do with that [08:35:27] 06Operations, 10Datasets-General-or-Unknown, 06WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#2274666 (10Addshore) [08:37:40] (03CR) 10Addshore: WIP DRAFT WMDE_Analytics module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269467 (owner: 10Addshore) [08:40:31] 06Operations, 13Patch-For-Review: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2274674 (10ema) We've chosen `local-top` as the boot stage to sleep at. From initramfs-tools(8): > local-top OR nfs-top After these scripts have been executed, the... [08:40:52] y'all got to be fking kidding me [08:40:56] I have pages to delete [08:41:41] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2274676 (10elukey) Updated stats after the weekend: [[ https://phab.wmfusercontent.org/file/data/ewqwcuuog2eqi6qy3qlh/PHID-FILE-n4u3x5rghk5pln5hpcft/mc1009_sta... [08:41:45] (03CR) 10Addshore: "By the looks of things this should be live but I don't see the logs in the location!" [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [08:41:51] 06Operations, 10Datasets-General-or-Unknown, 06WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#2274678 (10Addshore) By the looks of things this should be live but I don't see the logs in the location! [08:42:14] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2274679 (10Riley_Huntley) p:05High>03Unbreak! https://commons.wikimedia.org/wik... [08:59:01] 06Operations, 10MediaWiki-extensions-CheckUser: Cron job to purge cu_changes - https://phabricator.wikimedia.org/T33454#2274819 (10MarcoAurelio) [09:05:33] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2275051 (10Lydia_Pintscher) [09:06:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5992535 keys - replication_delay is 646 [09:08:04] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5982901 keys - replication_delay is 5 [09:09:45] (03CR) 10Elukey: "https://puppet- scompiler.wmflabs.org/2703/" [puppet] - 10https://gerrit.wikimedia.org/r/287237 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [09:13:46] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2275217 (10Aklapper) p:05Unbreak!>03High >>! In T132921#2274679, @Riley_Huntley... [09:15:08] !log bootstrap restbase2007-a T132976 [09:15:09] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:58] !log restarting elasticsearch server elastic1020.eqiad.wmnet (T110236), includes JDK upgrade [09:27:59] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:59] 06Operations, 10DBA, 07Performance, 07RfC, 05codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2275302 (10jcrespo) All of this is good, but realistically I would be happy with a couple of fixes for now: 1) Server connections failures s... [09:38:25] (03PS1) 10Elukey: Remove duplicate of 'lru_crawler' in the mc[12]009 memcached configs. [puppet] - 10https://gerrit.wikimedia.org/r/287582 (https://phabricator.wikimedia.org/T129963) [09:39:45] (03PS8) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [09:39:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) (owner: 10Filippo Giunchedi) [09:39:57] (03PS1) 10Muehlenhoff: Extend the imagemagick blacklist [puppet] - 10https://gerrit.wikimedia.org/r/287584 [09:40:19] (03PS2) 10Filippo Giunchedi: statsite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/284871 [09:40:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/284871 (owner: 10Filippo Giunchedi) [09:40:51] (03PS1) 10Jcrespo: Increase db1070 weight after repooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287586 [09:44:04] (03CR) 10Jcrespo: [C: 032] Increase db1070 weight after repooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287586 (owner: 10Jcrespo) [09:45:23] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1070 weight after repooling (duration: 00m 38s) [09:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:45:48] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove duplicate of 'lru_crawler' in the mc[12]009 memcached configs. [puppet] - 10https://gerrit.wikimedia.org/r/287582 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [09:48:12] (03PS2) 10Elukey: Remove duplicate of 'lru_crawler' in the mc[12]009 memcached configs. [puppet] - 10https://gerrit.wikimedia.org/r/287582 (https://phabricator.wikimedia.org/T129963) [09:51:57] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2275316 (10hashar) We still ship `php5-fss` on Jessie (and Trusty) via `mediawiki::packages::php5`. Apparently it is not nee... [09:53:19] (03CR) 10Elukey: [C: 032] Remove duplicate of 'lru_crawler' in the mc[12]009 memcached configs. [puppet] - 10https://gerrit.wikimedia.org/r/287582 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [09:55:09] (03PS2) 10Muehlenhoff: Extend the imagemagick blacklist [puppet] - 10https://gerrit.wikimedia.org/r/287584 [09:56:06] !log memcached restarted on mc1009, now running with slab_reassign,maxconns_fast,hash_algorithm=murmur3,slab_automove,lru_crawler,lru_maintainer (T129963, performance experiment) [09:56:06] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [09:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:40] (03PS2) 10Yuvipanda: Add a registry enforcer + tests [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 [09:58:50] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2275325 (10jcrespo) >>! In T132921#2274551, @jcrespo wrote: > From the infrastructu... [09:59:56] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2275326 (10jcrespo) [x] Confirm out of cluster/service group [10:03:49] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2275327 (10hashar) 05Open>03Resolved a:03hashar Thank you @akosiaris . I have confirmed HHVM is gone from permanent CI slaves and from... [10:03:50] (03PS1) 10Ladsgroup: ores: Send icigna report to IRC [puppet] - 10https://gerrit.wikimedia.org/r/287590 [10:04:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Extend the imagemagick blacklist [puppet] - 10https://gerrit.wikimedia.org/r/287584 (owner: 10Muehlenhoff) [10:05:14] (03PS3) 10Yuvipanda: Add a registry enforcer + tests [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 (https://phabricator.wikimedia.org/T133515) [10:05:19] (03PS2) 10Ladsgroup: ores: Send icigna report to IRC [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) [10:07:39] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 104 bytes in 0.005 second response time [10:08:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few minor comments, I'll reserve to do a better review once I read about admission controllers a bit better." [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 (https://phabricator.wikimedia.org/T133515) (owner: 10Yuvipanda) [10:08:39] (03CR) 10Giuseppe Lavagetto: "Comments are here!" (033 comments) [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 (https://phabricator.wikimedia.org/T133515) (owner: 10Yuvipanda) [10:11:38] (03CR) 10Yuvipanda: Add a registry enforcer + tests (033 comments) [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 (https://phabricator.wikimedia.org/T133515) (owner: 10Yuvipanda) [10:11:40] (03PS1) 10Jcrespo: Remove (almost) all references to db1058 on puppet [puppet] - 10https://gerrit.wikimedia.org/r/287591 (https://phabricator.wikimedia.org/T134360) [10:12:27] (03PS2) 10Jcrespo: Remove (almost) all references to db1058 on puppet [puppet] - 10https://gerrit.wikimedia.org/r/287591 (https://phabricator.wikimedia.org/T134360) [10:13:29] (03CR) 10Jcrespo: [C: 032] Remove (almost) all references to db1058 on puppet [puppet] - 10https://gerrit.wikimedia.org/r/287591 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [10:16:51] (03PS1) 10Jcrespo: Remove db1058 entries [dns] - 10https://gerrit.wikimedia.org/r/287593 (https://phabricator.wikimedia.org/T134360) [10:18:21] ACKNOWLEDGEMENT - Restbase root url on restbase2007 is CRITICAL: Connection refused Filippo Giunchedi bootstrap [10:18:23] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: Connection refused Filippo Giunchedi bootstrap [10:18:23] ACKNOWLEDGEMENT - restbase endpoints health on restbase2007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.175, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrap [10:19:11] ACKNOWLEDGEMENT - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 104 bytes in 0.002 second response time Filippo Giunchedi jessie conversion [10:21:55] !log restarted eventlogging on eventlogging1001 for security upgrades [10:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:24] (03PS1) 10DCausse: Revert "Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287594 [10:22:43] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287594 (owner: 10DCausse) [10:23:26] (03Abandoned) 10DCausse: Revert "Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287594 (owner: 10DCausse) [10:28:49] !log general decommission of db1058 (puppet, salt, etc.) [10:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:15] (03PS1) 10ArielGlenn: length max sleep time between jobs [dumps] - 10https://gerrit.wikimedia.org/r/287595 [10:31:38] https://phabricator.wikimedia.org/P3019 [10:36:28] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2275400 (10jcrespo) @Cmjohnson I have removed it from "mediawiki" and "puppet", dhcp, salt, puppet certs, neon. I have not removed it from netboot/preseed as a range is used and name should n... [10:40:58] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:41:34] (03PS2) 10ArielGlenn: lengthen max sleep time between jobs and randomize [dumps] - 10https://gerrit.wikimedia.org/r/287595 [10:42:48] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [10:51:20] (03CR) 10ArielGlenn: [C: 032] lengthen max sleep time between jobs and randomize [dumps] - 10https://gerrit.wikimedia.org/r/287595 (owner: 10ArielGlenn) [10:54:51] !log restarting elasticsearch server elastic1021.eqiad.wmnet (T110236), includes JDK upgrade [10:54:51] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [10:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:57] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: puppet fail [11:02:15] !log restarting gitblit for java update [11:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:12:32] !log restarting archiva on titanium for java update [11:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:47] !log restarting elasticsearch server elastic1022.eqiad.wmnet (T110236), includes JDK upgrade [11:17:47] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [11:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:06] (03PS4) 10Yuvipanda: Add a registry enforcer + tests [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 [11:20:23] _joe_: ^ additional test case [11:21:39] fixing the wording on error now [11:21:54] (03PS5) 10Yuvipanda: Add a registry enforcer + tests [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 [11:24:38] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:27:26] !log restarting Jenkins [11:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:08] is there an easy way to see the changes for a given tag across all deployed extensions? i.e. I'm trying to see what changed in wmf.23 across the board [11:30:09] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.107 second response time [11:32:39] !log rolling restart of ocg for openssl update [11:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:47] (03PS1) 10Gehel: WIP - Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [11:39:05] (03PS1) 10Filippo Giunchedi: graphite: add graphite2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/287601 [11:46:11] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2275647 (10Gehel) I wanted to have a look at the graph before thinking about alerts. Looking at it n... [11:48:30] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2275661 (10mobrovac) The full incident report is available at [wikitech:20160505-ChangeProp_RESTBase_Parsoid](https://wikitech.wikimedia.org/wiki... [11:51:55] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2275662 (10Gehel) I'd say this is fairly low priority. Elasticsearch has good monitoring, it is not urgent to improve it significantly. I'd like keep tha... [11:52:07] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2275663 (10Gehel) p:05Normal>03Low [11:52:23] (03PS1) 10Faidon Liambotis: mediawiki: remove all php5-fss references [puppet] - 10https://gerrit.wikimedia.org/r/287603 (https://phabricator.wikimedia.org/T95002) [11:53:50] Krinkle: ping [12:00:31] paravoid: pong [12:00:41] * Krinkle is catching up with last 7 days of activity [12:01:12] hey :) [12:01:20] https://phabricator.wikimedia.org/T133330 is UBN [12:01:24] and shows no progress [12:03:22] !log mwscript initSiteStats.php --wiki iawiki --update (T134749) [12:03:22] T134749: Fix statistics count on iawiki - https://phabricator.wikimedia.org/T134749 [12:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:22] paravoid: hello! Last wednesday you were looking to catch hostnames typo. I have a patch to change the typo matcher to use "grep -E" , should let one add some extended regular expressions to catch invalid hostnames [12:06:36] paravoid: OK. checking. I was on vacation for 12 (7+2+5) of those 14 days. [12:06:46] gotta escape one entry in /typos though https://gerrit.wikimedia.org/r/#/c/286938/2/typos,cm [12:07:19] Krinkle: yeah, I figured :) [12:07:57] hashar: cool, thank you so much! are you adding those hostname typos as well? [12:08:05] not yet [12:08:21] made that a very simple change which is easy to review merge [12:08:25] (03PS1) 10Elukey: Reserve extra IP addresses for the new AQS hosts. [dns] - 10https://gerrit.wikimedia.org/r/287605 (https://phabricator.wikimedia.org/T133785) [12:08:33] so I can update the Jenkins job to switch from fgrep to grep -E ( https://gerrit.wikimedia.org/r/#/c/286937/2/jjb/misc.yaml,cm ) [12:08:40] then can look at adding moaaar regex [12:09:17] I tested a few you have suggested on the task [12:09:19] looks legit [12:09:28] eg https://phabricator.wikimedia.org/T133047#2264567 [12:14:46] so essentially we can merge in https://gerrit.wikimedia.org/r/#/c/286938/2/typos,cm then I will update the CI job [12:14:51] and from there one can amend /typos as needed [12:16:22] great :) [12:32:19] !log restarting elasticsearch server elastic1023.eqiad.wmnet (T110236), includes JDK upgrade [12:32:20] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:37:38] paravoid: could you land the puppet part? [12:41:09] (03PS1) 10Elukey: Add aqs100[345] DCHP configuration. [puppet] - 10https://gerrit.wikimedia.org/r/287607 (https://phabricator.wikimedia.org/T133785) [12:41:51] (03PS3) 10Faidon Liambotis: Adjust /typos to use extended regular expressions [puppet] - 10https://gerrit.wikimedia.org/r/286938 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [12:41:58] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Adjust /typos to use extended regular expressions [puppet] - 10https://gerrit.wikimedia.org/r/286938 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [12:42:06] near [12:42:09] neat [12:43:28] hashar: (not related to typos) https://gerrit.wikimedia.org/r/#/c/287603/ [12:43:44] !log Updating Jenkins job operations-puppet-typos to use extended regular expressions when reading /typos ( T133047 ) [12:43:45] T133047: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047 [12:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:45:18] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 13Patch-For-Review: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2275809 (10hashar) One can add more extended regular expression to the `/typos` file. Eg the ones Faidon proposed such... [12:48:35] (03CR) 10Hashar: "In CI we still have Precise slaves which are used to run Zend 5.3, and we most probably want to keep php-fss there :( So I would keep it " [puppet] - 10https://gerrit.wikimedia.org/r/287603 (https://phabricator.wikimedia.org/T95002) (owner: 10Faidon Liambotis) [12:49:15] paravoid: CI still has Zend 5.3 on Precise :( [12:49:38] why? [12:49:59] precise is 4 years old now, what's the point of testing against that? [12:50:06] that is to cover MediaWiki LTS 1.23 which is still claiming Zend 5.3 [12:50:20] and thus to prevent backport of patches that are not valid 5.3 syntax [12:50:44] supported until May 2017? [12:50:46] that is till we get Zend 5.3 on Jessie somehow or 1.23 is EOL (in May 2017) [12:51:10] ok [12:51:15] Mw 1.25 is gone in June 2016. 1.26 till November 2016 [12:51:19] that is really a pity [12:51:20] in any case, fss is just is just an optimization [12:51:29] and I doubt anyone but us ever used it [12:51:49] yeah maybe it is not a big deal to have it gone [12:52:01] I am not sure it has much impact on the phpunit runtime anyway [12:53:09] paravoid: lets drop it [12:53:22] !log restarting elasticsearch server elastic1024.eqiad.wmnet (T110236), includes JDK upgrade [12:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:48] (03CR) 10Hashar: [C: 031] "Lets drop it. Even if we still run some jobs with Zend 5.3, lack of php-fss is not going to cause the PHPUnit job to fail." [puppet] - 10https://gerrit.wikimedia.org/r/287603 (https://phabricator.wikimedia.org/T95002) (owner: 10Faidon Liambotis) [12:53:55] one less tech debt to handle this way [12:54:20] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:54:30] (03PS1) 10Filippo Giunchedi: graphite: export /var/lib/carbon via rsync [puppet] - 10https://gerrit.wikimedia.org/r/287608 [12:54:45] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2275859 (10elukey) @Cmjohnson: tried to file a code review for the DHCP config, not sure if correct though! [12:54:54] hashar: nod :) [12:55:00] !log Enabled a setting in Jenkins for T132895 [12:55:07] security bug [12:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:30] hashar: also note that php5-fss is in a strange state legally/license wise [12:57:53] yeah reading the original README that is a GPL vs PHP License incompatibility [12:57:55] neat [12:58:36] so you could not build php with fss embedded but could still link to it maybe [13:00:28] PROBLEM - NTP on ganeti1003 is CRITICAL: NTP CRITICAL: No response from NTP server [13:00:57] (03PS2) 10Filippo Giunchedi: graphite: add graphite2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/287601 [13:01:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: add graphite2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/287601 (owner: 10Filippo Giunchedi) [13:07:04] !log restarting nginx on carbon/apt.wikimedia.org to pick up openssl update [13:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:09:16] (03PS2) 10Gehel: WIP - Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [13:11:38] 06Operations, 10ops-codfw, 10DBA: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2275897 (10jcrespo) [13:11:59] 06Operations, 10ops-codfw, 10DBA: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2275912 (10jcrespo) a:03jcrespo [13:12:19] !log restarting hadoop java daemons for Java upgrades on analytics102X and analytics 103X hosts [13:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:18] (03PS1) 10Jcrespo: [WIP]Remove es2001-es2010 from production puppet [puppet] - 10https://gerrit.wikimedia.org/r/287612 (https://phabricator.wikimedia.org/T134755) [13:13:38] (03PS2) 10Filippo Giunchedi: graphite: export /var/lib/carbon via rsync [puppet] - 10https://gerrit.wikimedia.org/r/287608 [13:13:55] (03PS1) 10Alexandros Kosiaris: servermon: Inform puppet gunicorn has no status command [puppet] - 10https://gerrit.wikimedia.org/r/287613 [13:14:27] !log restarting elasticsearch server elastic1025.eqiad.wmnet (T110236), includes JDK upgrade [13:14:28] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:23] (03PS2) 10Jcrespo: [WIP]Remove es2001-es2010 from production puppet [puppet] - 10https://gerrit.wikimedia.org/r/287612 (https://phabricator.wikimedia.org/T134755) [13:24:19] (03PS1) 10Hashar: typos: validate hostname number based on DC [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) [13:25:14] (03PS1) 10Filippo Giunchedi: install_server: install graphite2002 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/287615 [13:25:48] (03CR) 10jenkins-bot: [V: 04-1] typos: validate hostname number based on DC [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [13:27:11] (03PS1) 10Muehlenhoff: Add salt grain for install1001 (for debdeploy) [puppet] - 10https://gerrit.wikimedia.org/r/287616 [13:27:28] (03CR) 10Ottomata: [C: 031] Add aqs100[345] DCHP configuration. [puppet] - 10https://gerrit.wikimedia.org/r/287607 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [13:27:52] !log restarting salt-master on neodymium to pick up openssl update [13:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:32] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2275960 (10BBlack) [13:29:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: install graphite2002 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/287615 (owner: 10Filippo Giunchedi) [13:29:36] 06Operations, 10Phabricator: Set up Yubikey support in Phabricator - https://phabricator.wikimedia.org/T134672#2275962 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:35:05] (03CR) 10Alexandros Kosiaris: [C: 031] "Premise looks good. As for testing... we can use the osm labsdbs and see if the synchronization keeps working. Reimporting 450+GBs however" [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [13:35:14] (03CR) 10Hashar: "The failure is in the kafka README file." [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [13:37:39] !log restarting elasticsearch server elastic1026.eqiad.wmnet (T110236), includes JDK upgrade [13:37:39] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:34] (03PS2) 10Alexandros Kosiaris: servermon: Inform puppet gunicorn has no status command [puppet] - 10https://gerrit.wikimedia.org/r/287613 [13:38:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: Inform puppet gunicorn has no status command [puppet] - 10https://gerrit.wikimedia.org/r/287613 (owner: 10Alexandros Kosiaris) [13:44:10] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2275973 (10faidon) I think having a flat service deployment model at the moment is fine, both be... [13:46:29] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2275985 (10BBlack) So, as it turns out, this is a general varnishd bug in our specific varnishd build. For purposes of this bug, our varnishd code is essentially 3.0.7... [13:46:54] 06Operations, 10Traffic, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2275988 (10BBlack) [13:46:58] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2233245 (10BBlack) [13:47:48] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2233245 (10BBlack) We now have some understanding of the mechanism of this bug ( T133866#2275985 ). It should go away in... [13:54:18] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2276000 (10Anomie) [13:55:10] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2276001 (10faidon) For what it's worth, when trying to switch our mirror on Launchpad to https: > The URI scheme "https" is not allowed. Only URIs with the f... [13:55:43] (03PS1) 10Mobrovac: Change-prop: no automatic restarts [puppet] - 10https://gerrit.wikimedia.org/r/287620 [13:59:03] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2276008 (10elukey) Stats from mc1009 after the restart, the odd behavior seems reproducible: [[ https://phab.wmfusercontent.org/file/data/g427qypgh3rxfxbiuwaa/... [14:00:56] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2276015 (10mobrovac) [14:01:37] (03PS1) 10Elukey: Remove testing parameters/settings from mc1009's memcached. [puppet] - 10https://gerrit.wikimedia.org/r/287621 (https://phabricator.wikimedia.org/T129963) [14:02:48] (03CR) 10Elukey: [C: 032] Remove testing parameters/settings from mc1009's memcached. [puppet] - 10https://gerrit.wikimedia.org/r/287621 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [14:02:55] (03PS2) 10Elukey: Remove testing parameters/settings from mc1009's memcached. [puppet] - 10https://gerrit.wikimedia.org/r/287621 (https://phabricator.wikimedia.org/T129963) [14:03:55] heya akosiaris, yt? q about apt updates for confluent [14:05:50] (03PS6) 10Addshore: WIP DRAFT WMDE_Analytics module [puppet] - 10https://gerrit.wikimedia.org/r/269467 [14:06:56] !log restarting elasticsearch server elastic1027.eqiad.wmnet (T110236), includes JDK upgrade [14:06:56] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:46] (03CR) 10Mobrovac: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [14:09:02] !log memcached restarted on mc1009 with only slab_reassign set (T129963) [14:09:03] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [14:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:43] ottomata: yes I am around [14:11:30] <_joe_> elukey: let's see how this goes [14:11:53] (03PS3) 10Ladsgroup: ores: Send icigna report to IRC [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) [14:13:53] (03CR) 10Filippo Giunchedi: [C: 031] Reserve extra IP addresses for the new AQS hosts. [dns] - 10https://gerrit.wikimedia.org/r/287605 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [14:14:07] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 (or 0.10?) - https://phabricator.wikimedia.org/T121562#2276061 (10Ottomata) [14:14:49] (03PS1) 10BBlack: Remove X-Pass-Stream support from cache_misc size+stream VCL [puppet] - 10https://gerrit.wikimedia.org/r/287623 (https://phabricator.wikimedia.org/T133866) [14:14:51] (03PS1) 10BBlack: Revert "text VCL: do_stream when creating hit-for-pass" [puppet] - 10https://gerrit.wikimedia.org/r/287624 (https://phabricator.wikimedia.org/T133866) [14:14:53] (03PS1) 10BBlack: Revert "Varnish: protect against external streampass header setting" [puppet] - 10https://gerrit.wikimedia.org/r/287625 (https://phabricator.wikimedia.org/T133866) [14:14:55] (03PS1) 10BBlack: Revert "Varnish: stream all pass traffic" [puppet] - 10https://gerrit.wikimedia.org/r/287626 (https://phabricator.wikimedia.org/T133866) [14:15:02] (03CR) 10Alexandros Kosiaris: "Not against this per se, but we need somehow to point out why we want these things. Like a comment/linked task (or both)" [puppet] - 10https://gerrit.wikimedia.org/r/287620 (owner: 10Mobrovac) [14:15:36] (03PS1) 10Ottomata: Add confluent mirror to get Kafka 0.9 in apt [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) [14:15:45] (03CR) 10BBlack: [C: 032 V: 032] Remove X-Pass-Stream support from cache_misc size+stream VCL [puppet] - 10https://gerrit.wikimedia.org/r/287623 (https://phabricator.wikimedia.org/T133866) (owner: 10BBlack) [14:15:53] akosiaris: https://gerrit.wikimedia.org/r/287627 [14:15:58] (03CR) 10Hashar: [C: 031] "Cherry picked on CI puppet master. Confirmed File[/srv/localhost-worker] is still managed by puppet." [puppet] - 10https://gerrit.wikimedia.org/r/286869 (owner: 10Hashar) [14:16:00] (03CR) 10BBlack: [C: 032 V: 032] Revert "text VCL: do_stream when creating hit-for-pass" [puppet] - 10https://gerrit.wikimedia.org/r/287624 (https://phabricator.wikimedia.org/T133866) (owner: 10BBlack) [14:16:02] was going to ask about VerifyRelease, but i *think* i figured that out [14:16:06] does that look right? [14:16:14] (03CR) 10BBlack: [C: 032 V: 032] Revert "Varnish: protect against external streampass header setting" [puppet] - 10https://gerrit.wikimedia.org/r/287625 (https://phabricator.wikimedia.org/T133866) (owner: 10BBlack) [14:16:26] (03CR) 10BBlack: [C: 032 V: 032] Revert "Varnish: stream all pass traffic" [puppet] - 10https://gerrit.wikimedia.org/r/287626 (https://phabricator.wikimedia.org/T133866) (owner: 10BBlack) [14:19:39] _joe_ I am already seeing evictions and 58 slabs as before [14:22:44] (03PS1) 10Muehlenhoff: Update to 4.4.9 [debs/linux44] - 10https://gerrit.wikimedia.org/r/287628 [14:23:55] !log restarting hadoop java daemons for Java upgrades on analytics104X and analytics105X hosts [14:24:01] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5996050 keys - replication_delay is 620 [14:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:02] (03PS1) 10BBlack: cache_misc: remove all do_stream=true [puppet] - 10https://gerrit.wikimedia.org/r/287633 (https://phabricator.wikimedia.org/T133490) [14:26:57] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: remove all do_stream=true [puppet] - 10https://gerrit.wikimedia.org/r/287633 (https://phabricator.wikimedia.org/T133490) (owner: 10BBlack) [14:29:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Inline comments. You also need to list confluent in" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:29:29] !log restarting cassandra on maps-test200[1234] (T134514) [14:29:30] T134514: Maps Cassandra Cluster: Restart cassandra-metrics-collector - https://phabricator.wikimedia.org/T134514 [14:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:37] (03PS2) 10Muehlenhoff: Update to 4.4.9 [debs/linux44] - 10https://gerrit.wikimedia.org/r/287628 [14:29:56] ottomata: posted comments on https://gerrit.wikimedia.org/r/#/c/287627/1 [14:29:59] ah ok akosiaris, also, i'm not so sure about that url, since I can't actually browse it [14:30:01] (03PS4) 10Krinkle: Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [14:30:02] !log restarting cassandra.metrics-collector on maps-test200[1234] (T134514) - correction [14:30:03] T134514: Maps Cassandra Cluster: Restart cassandra-metrics-collector - https://phabricator.wikimedia.org/T134514 [14:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:10] (03CR) 10Krinkle: [C: 031] "Nothing in the last 30 days. Yay" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [14:30:11] ottomata: I 've had the exact same problem [14:30:11] http://packages.confluent.io/deb/2.0 [14:30:27] and that XML output only infuriates me [14:30:27] but, if it works in a sources list, it should be right? [14:30:30] ahha [14:30:30] yeah [14:31:00] oh the url is correct [14:31:17] hmm, akosiaris, even if i'm not puppetizing for trusty, i think i should probably add it to trusty too, so we can use it in mw-vagrant [14:31:22] more easily, that ok? [14:31:45] if you know the packages works just as well in trusty, sure [14:31:45] do I have to make a separate updates entry? [14:32:00] (03PS2) 10Mobrovac: Change-prop: no automatic restarts [puppet] - 10https://gerrit.wikimedia.org/r/287620 [14:32:02] yeah, akosiaris it doesn't have any init scripts or anything :/ just jars and shell scripts [14:32:30] !log restarting elasticsearch server elastic1028.eqiad.wmnet (T110236), includes JDK upgrade [14:32:31] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:32:33] i see that there are two hp-mcp-* [14:32:37] akosiaris: amended commit msg ^ [14:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:42] one for jessie and one for trusty, i guess i do the same? [14:32:58] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2276138 (10BBlack) 05Open>03Resolved a:03BBlack This works now. There's a significant pause at the start of the transfer from the... [14:33:00] ottomata: probably safer indeed [14:33:07] hmm, or i could just make Suite: stable? [14:33:14] elasticsearch has that [14:33:29] Suite: just refers to the suite of the upstream [14:33:35] not ours [14:33:36] oh [14:33:38] uh [14:33:43] 06Operations, 10Traffic, 10Wikidata, 13Patch-For-Review: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2276144 (10BBlack) 05Open>03Resolved My test cases on cache_text work now, should be resolved! [14:34:26] not sure what that should be then... [14:35:01] in trusty's case ? I think "stable" as well [14:35:41] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/287608 (owner: 10Filippo Giunchedi) [14:36:15] 06Operations, 10DBA: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2276168 (10jcrespo) [14:36:26] 06Operations, 10DBA: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10jcrespo) [14:37:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Change-prop: no automatic restarts [puppet] - 10https://gerrit.wikimedia.org/r/287620 (owner: 10Mobrovac) [14:37:19] PROBLEM - graphite.wikimedia.org on graphite2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 104 bytes in 0.075 second response time [14:37:21] thnx akosiaris! [14:37:28] :-) [14:37:47] akosiaris: ? [14:37:53] i need two? [14:38:00] one that says jessie and another that says stable? [14:38:04] OH [14:38:06] SORRY [14:38:10] just saw your comments inline [14:38:11] doh [14:38:12] looking [14:41:46] akosiaris: how did you see Architectures? [14:41:53] the Architecture on the kafka package is [14:41:53] Architecture: all [14:42:12] oh in the release file [14:42:12] AH [14:42:16] i am learning something! [14:42:33] ok [14:42:36] :-) [14:45:27] 06Operations, 10Traffic, 10Wikidata, 13Patch-For-Review: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2276263 (10MZMcBride) This looks fixed to me. Thank you, @BBlack! [14:45:30] (03CR) 10Alexandros Kosiaris: [C: 031] "this looks ok, but first requires creating irc-ores in the private repo. DO NOT MERGE yet." [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [14:46:25] (03PS2) 10Ottomata: Add confluent mirror to get Kafka 0.9 in apt [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) [14:47:14] (03CR) 10Hashar: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/155279 (https://bugzilla.wikimedia.org/69478) (owner: 10Hashar) [14:47:23] oh [14:47:24] zuul dead [14:47:29] akosiaris: pushed new patch https://gerrit.wikimedia.org/r/#/c/287627 [14:47:45] i didn't add two entries in updates, not sure what would have been different in the second one [14:47:50] will that work as is for both jessie and trusty? [14:48:39] ottomata1: I am unsure, I think so [14:49:07] ok might as well try! [14:49:07] ottomata1: oh, no it will [14:49:18] it will work ok... hwraid is also reused [14:49:37] ah because ah right, in distributions it is listed [14:49:38] ok [14:49:39] makes sense [14:49:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add confluent mirror to get Kafka 0.9 in apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:49:47] oops [14:49:49] left get in release [14:49:53] :-) [14:49:56] (03CR) 10Addshore: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269467 (owner: 10Addshore) [14:50:25] (03PS3) 10Ottomata: Add confluent mirror to get Kafka 0.9 in apt [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) [14:50:39] akosiaris: done [14:51:01] (03CR) 10Alexandros Kosiaris: [C: 031] Add confluent mirror to get Kafka 0.9 in apt [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:51:07] !log Zuul went deadlock. Restarted [14:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:53] thanks akosiaris [14:51:57] (03PS4) 10Ottomata: Add confluent mirror to get Kafka 0.9 in apt [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) [14:52:08] (03CR) 10Ottomata: [C: 032 V: 032] Add confluent mirror to get Kafka 0.9 in apt [puppet] - 10https://gerrit.wikimedia.org/r/287627 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:52:14] some of those patches are killing zuul entirely [14:53:22] (03PS8) 10Addshore: WIP DRAFT WMDE_Analytics module [puppet] - 10https://gerrit.wikimedia.org/r/269467 [14:54:11] hashar: it might have been mine? [14:54:38] <_joe_> elukey: yeah, same horror, I see [14:55:12] (03CR) 10jenkins-bot: [V: 04-1] WIP DRAFT WMDE_Analytics module [puppet] - 10https://gerrit.wikimedia.org/r/269467 (owner: 10Addshore) [14:56:31] Krenair: https://snag.gy/Nw91A4.jpg [14:56:52] Waiting for 7 lagged database(s) [14:57:38] PROBLEM - Hadoop NodeManager on analytics1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:00:05] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160509T1500). [15:00:05] yurik: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:56] hm akosiaris looks like I got the key wrong [15:00:57] Error: unknown key '670540C841468433'! [15:01:04] to get it, i dled confluent's key [15:01:10] imported to my gpg [15:01:12] and then did [15:01:18] gpg --with-colons --list-keys Confluent [15:01:22] !log turning off es2001-es2010 [15:01:25] and grabbed it from the pub key [15:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:31] pub:-:4096:1:670540C841468433:1423250088:::-:::scESC: [15:01:55] (03CR) 10Krinkle: [C: 04-1] Switched to pt-heartbeat lag detection on s6 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [15:02:55] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2276350 (10Andrew) p:05Normal>03High [15:03:03] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2033017 (10Andrew) a:03Andrew [15:03:07] Hi. [15:03:29] Is someone swatting? [15:04:04] Dereckson: I can SWAT, I noticed yurik wasn't around, and I didn't see anything else on the Deployments page. [15:04:42] I confirm, there is only Yurik on the page. [15:07:01] (03CR) 10Giuseppe Lavagetto: [C: 031] "Given what hashar said, this is fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/287603 (https://phabricator.wikimedia.org/T95002) (owner: 10Faidon Liambotis) [15:07:03] (03PS3) 10Jcrespo: Remove es2001-es2010 from production puppet [puppet] - 10https://gerrit.wikimedia.org/r/287612 (https://phabricator.wikimedia.org/T134755) [15:07:36] (03PS4) 10Jcrespo: Remove es2001-es2010 from production puppet [puppet] - 10https://gerrit.wikimedia.org/r/287612 (https://phabricator.wikimedia.org/T134755) [15:07:44] (03PS2) 10Faidon Liambotis: mediawiki: remove all php5-fss references [puppet] - 10https://gerrit.wikimedia.org/r/287603 (https://phabricator.wikimedia.org/T95002) [15:07:53] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mediawiki: remove all php5-fss references [puppet] - 10https://gerrit.wikimedia.org/r/287603 (https://phabricator.wikimedia.org/T95002) (owner: 10Faidon Liambotis) [15:08:34] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2276389 (10faidon) 05Open>03Resolved a:03faidon [15:08:50] \O/ [15:09:13] (03PS5) 10Jcrespo: Remove es2001-es2010 from production puppet [puppet] - 10https://gerrit.wikimedia.org/r/287612 (https://phabricator.wikimedia.org/T134755) [15:09:17] <_joe_> paravoid: don't we need nutcracker? [15:09:34] <_joe_> oh right, the puppet class only [15:09:58] RECOVERY - Hadoop NodeManager on analytics1051 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:10:07] I took hashar's word for it (that fss was the last blocker), didn't really audit the whole task's status [15:10:14] so if this was wrong, do reopen :) [15:10:15] it is ok [15:10:24] ottomata: yes, it also needs importing the key in the root's gpg repo [15:10:33] <_joe_> paravoid: I think fonts have a separate one [15:10:35] ottomata: I just did so, you should be good to go now [15:10:36] analytics1051 was me restarting hadoop for java upgrades [15:10:36] (03CR) 10Jcrespo: [C: 032] Remove es2001-es2010 from production puppet [puppet] - 10https://gerrit.wikimedia.org/r/287612 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [15:11:06] brb [15:11:09] (03CR) 10GWicke: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [15:11:39] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5979808 keys - replication_delay is 0 [15:12:36] oh my it is never ending [15:12:43] _joe_: yeah, but I think that's done, no? [15:13:07] https://phabricator.wikimedia.org/T102623 [15:13:07] <_joe_> paravoid: I didn't check if we merged the related patches [15:13:18] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:35] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2276419 (10hashar) Puppet looks fine on the CI slave integration-slave-jessie-1001 :-} [15:13:41] <_joe_> yes, it seems ok [15:13:48] https://phabricator.wikimedia.org/T129500 seems to be the only task pending, but I think it can be closed -- hashar? [15:13:59] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:07] uhoh [15:14:08] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:08] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:09] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:10] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2276434 (10EBernhardson) The spikes look to mostly (but not completely) coincide with the daily rebu... [15:14:14] I have another issue with the ImageMagic policy though [15:14:34] paravoid: I havent verified the gujarati font yet :( [15:14:39] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:50] <_joe_> hashar: what's the issue with the policy? [15:14:51] and looks like php5-fss does not please puppet in prod :( [15:14:57] _joe_: no clue yet [15:14:59] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:01] (03PS1) 10Dereckson: Add National Digital Library of Brazil domains to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287635 [15:15:19] looking at the logs, I think it's just the usual manifest/fileserver race [15:15:29] so, very temporary [15:16:50] _joe_: Trusty and the puppet manifest install the policy under /etc/ImageMagic/ but on Jessie that is suffixed with -6 : /etc/ImageMagick-6 :-} [15:17:18] I would just symlink it [15:17:23] modules/imagemagick/manifests/install.pp:7: file { '/etc/ImageMagick/policy.xml': [15:17:24] modules/ocg/templates/usr.bin.nodejs.apparmor.erb:74: /etc/ImageMagick/** r, [15:17:25] eek [15:17:32] (03PS1) 10Dereckson: Add museudaimigracao.org.br to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287636 (https://phabricator.wikimedia.org/T134566) [15:17:39] <_joe_> hashar: no please, let's do it correctly [15:18:01] <_joe_> (I am currently busy but I can take a look in 10 mins) [15:22:24] (03CR) 10Hashar: [C: 04-1] "That no more provision the package building stuff on Trusty. Thus we have to migrate the debian-glue Jenkins job to Jessie which is T9554" [puppet] - 10https://gerrit.wikimedia.org/r/286873 (owner: 10Hashar) [15:23:11] thcipriani: I've noticed two wgCopyUploadDomaines change were waiting and added them to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160509T1500. Would you like to SWAT them when/if Yurik comes too? Alternatively, I can deploy them or report them for a further SWAT if you prefer. [15:23:49] (03PS1) 10Andrew Bogott: Spreadcheck should return 0 when everything is good. [puppet] - 10https://gerrit.wikimedia.org/r/287637 [15:24:23] Dereckson: I can SWAT them now. [15:24:31] Thanks [15:24:43] 06Operations: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie - https://phabricator.wikimedia.org/T134773#2276503 (10hashar) [15:24:47] _joe_: filled as https://phabricator.wikimedia.org/T134773 [15:25:09] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287635 (owner: 10Dereckson) [15:25:42] (03Merged) 10jenkins-bot: Add National Digital Library of Brazil domains to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287635 (owner: 10Dereckson) [15:26:10] 06Operations, 07Performance, 05codfw-rollout: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#2276518 (10Joe) I finallly got to a point where a debian package (at least a basic version) is not far down the line: - I included in the source folly, fbthrif... [15:26:39] 06Operations: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie - https://phabricator.wikimedia.org/T134773#2276521 (10Joe) a:03Joe [15:30:07] (03PS1) 10Ladsgroup: ores: set nginx timeout fail time 60s [puppet] - 10https://gerrit.wikimedia.org/r/287640 (https://phabricator.wikimedia.org/T111806) [15:30:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287636 (https://phabricator.wikimedia.org/T134566) (owner: 10Dereckson) [15:30:22] (03PS1) 10Faidon Liambotis: wmflib/os_version: add Ubuntu xenial/yakkety [puppet] - 10https://gerrit.wikimedia.org/r/287641 [15:30:24] (03PS1) 10Faidon Liambotis: imagemagick: fix policy.xml path for newer versions [puppet] - 10https://gerrit.wikimedia.org/r/287642 (https://phabricator.wikimedia.org/T134773) [15:30:43] (03Merged) 10jenkins-bot: Add museudaimigracao.org.br to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287636 (https://phabricator.wikimedia.org/T134566) (owner: 10Dereckson) [15:30:49] (03CR) 10Faidon Liambotis: [C: 032] wmflib/os_version: add Ubuntu xenial/yakkety [puppet] - 10https://gerrit.wikimedia.org/r/287641 (owner: 10Faidon Liambotis) [15:30:57] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add National Digital Library of Brazil domains to wgCopyUploadsDomains [[gerrit:287635]] (duration: 00m 27s) [15:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:51] Testing. [15:31:57] (03CR) 10jenkins-bot: [V: 04-1] imagemagick: fix policy.xml path for newer versions [puppet] - 10https://gerrit.wikimedia.org/r/287642 (https://phabricator.wikimedia.org/T134773) (owner: 10Faidon Liambotis) [15:32:28] (03PS3) 10BBlack: Text VCL: RB ?redirect=false optimization [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) [15:32:35] <_joe_> paravoid: oh you've already done both commits :P [15:32:38] <_joe_> cool [15:33:44] ok thanks akosiaris [15:34:01] (03PS2) 10Faidon Liambotis: imagemagick: fix policy.xml path for newer versions [puppet] - 10https://gerrit.wikimedia.org/r/287642 (https://phabricator.wikimedia.org/T134773) [15:34:18] (03PS2) 10Alexandros Kosiaris: ores: set nginx timeout fail time 60s [puppet] - 10https://gerrit.wikimedia.org/r/287640 (https://phabricator.wikimedia.org/T111806) (owner: 10Ladsgroup) [15:34:23] _joe_: that require => Class isn't really needed, is it? [15:34:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: set nginx timeout fail time 60s [puppet] - 10https://gerrit.wikimedia.org/r/287640 (https://phabricator.wikimedia.org/T111806) (owner: 10Ladsgroup) [15:34:27] (03CR) 10BBlack: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [15:34:48] <_joe_> paravoid: actually, if the file is installed before the package is, puppet will fail [15:35:19] _joe_: yes, but the class has a require_package at the top [15:35:33] <_joe_> paravoid: which doesn't guarantee order of execution [15:35:43] <_joe_> given it includes a class via include [15:35:46] <_joe_> which is floating [15:35:48] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:36:14] <_joe_> if someone declares the same package in a different place with require_package, puppet might decide to apply it after everything else [15:36:15] thcipriani: works [15:36:15] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276577 (10BBlack) Patch updated to include matching `redirect=0` (and some comments made clearer, and the deliver-time code was re-arranged to be slightly... [15:36:27] Dereckson: kk, continuing. [15:36:31] that's ensure_package, isn't it? [15:36:40] <_joe_> require_package [15:36:47] <_joe_> ensure_package is simply broken [15:36:53] <_joe_> which is what stdlib ships [15:36:57] # In other words, it ensures the package(s) are installed before [15:36:57] # evaluating any of the resources in the current scope. [15:37:24] say our docs of require_pakage [15:37:50] <_joe_> yeah let me check the code [15:37:56] <_joe_> my memory of it was different [15:37:57] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add museudaimigracao.org.br to wgCopyUploadsDomains [[gerrit:287636]] (duration: 00m 32s) [15:37:59] ^ Dereckson check please [15:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:11] akosiaris: hm, i am running reprepro update on carbon [15:38:16] <_joe_> send Puppet::Parser::Functions.function(:require) [15:38:16] doesn't seem to pick this up [15:38:17] <_joe_> right [15:38:21] 06Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2276583 (10fgiunchedi) update: this is complete on the cassandra/restbase side, pending hardware decommission [15:38:25] not that it matters that much for that single line [15:38:38] <_joe_> paravoid: I think we changed that from the first versions [15:38:39] but it'd be nice to clarify this detail :) [15:38:45] <_joe_> I definitely remember we used include [15:39:01] should I --noskipold? [15:39:10] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2276589 (10Papaul) [15:39:27] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2264720 (10Papaul) a:05Papaul>03Gehel [15:39:35] _joe_: I think you might be confusing ensure_ with require_ [15:39:35] <_joe_> uhm memory failing me apparently [15:39:49] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:39:51] <_joe_> paravoid: ensure_package is in stdlib and is seriously broken [15:39:57] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:40:12] (03CR) 10Hashar: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/155279 (https://bugzilla.wikimedia.org/69478) (owner: 10Hashar) [15:40:17] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:40:38] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:40:58] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:07] (03CR) 10Giuseppe Lavagetto: [C: 031] imagemagick: fix policy.xml path for newer versions [puppet] - 10https://gerrit.wikimedia.org/r/287642 (https://phabricator.wikimedia.org/T134773) (owner: 10Faidon Liambotis) [15:41:17] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:18] ottomata: hmm perhaps... not sure [15:41:27] (03PS3) 10Faidon Liambotis: imagemagick: fix policy.xml path for newer versions [puppet] - 10https://gerrit.wikimedia.org/r/287642 (https://phabricator.wikimedia.org/T134773) [15:41:31] (03PS2) 10Jforrester: Enable VisualEditor by default in SET mode on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 [15:41:49] (03CR) 10Faidon Liambotis: [C: 032] imagemagick: fix policy.xml path for newer versions [puppet] - 10https://gerrit.wikimedia.org/r/287642 (https://phabricator.wikimedia.org/T134773) (owner: 10Faidon Liambotis) [15:41:52] (03PS1) 10Jcrespo: Remove dns entries for es2001-es2010 [dns] - 10https://gerrit.wikimedia.org/r/287645 (https://phabricator.wikimedia.org/T134755) [15:41:56] !log restarting elasticsearch server elastic1029.eqiad.wmnet (T110236), includes JDK upgrade [15:41:57] thcipriani: tested, works [15:42:01] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:06] Dereckson: thank you for checking! [15:42:30] You're welcome. Thank you for deploying. [15:43:10] (03PS3) 10Jforrester: Enable VisualEditor by default in SET mode for logged-in users on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 [15:43:14] (03PS1) 10Jforrester: Enable VisualEditor for IP users on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287646 [15:43:27] James_F: oh wow :) [15:43:55] (03CR) 10GWicke: [C: 031] Text VCL: RB ?redirect=false optimization [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [15:43:57] (03PS1) 10Hashar: contint: drop role::ci::slave::labs::light [puppet] - 10https://gerrit.wikimedia.org/r/287648 [15:44:04] hm, akosiaris --noskipold also no effect [15:44:27] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 621 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5983136 keys - replication_delay is 621 [15:45:12] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2276610 (10jcrespo) We are going to have lots of spares with T129452. I would retire es2005-es2010 and use its disks as spares, reuse es2001-4 for ES disaster recovery and archival. [15:47:22] (03CR) 10Hashar: [C: 04-1] "Please reuse the original commit message https://gerrit.wikimedia.org/r/#/c/274033/ :-)" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [15:48:08] (03CR) 10Hashar: "No more used on labs. Added to Tuesday, May 10 puppet swat" [puppet] - 10https://gerrit.wikimedia.org/r/287648 (owner: 10Hashar) [15:49:43] 06Operations, 10Datasets-General-or-Unknown, 06WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#2276638 (10ArielGlenn) /a/log/webrequest/archive/dumps.wikimedia.org on stat1002 is full of them. Are you looking in the right place? [15:50:19] 06Operations, 10Datasets-General-or-Unknown, 06WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#2276642 (10Addshore) Ahh there you go! I was looking in the wrong place!!!!!! [15:51:26] !log disabled camus+puppet on analytics1027 as prep step for maintenance on the cluster. [15:54:19] (03PS4) 10BBlack: Text VCL: RB ?redirect=false optimization [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) [15:54:29] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: RB ?redirect=false optimization [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [15:54:37] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2276655 (10mmodell) @faidon: Yes please join the meeting. [15:54:45] (03PS2) 10Hashar: contint: move package_builder setup to its own class [puppet] - 10https://gerrit.wikimedia.org/r/286873 [15:55:13] (03CR) 10Hashar: "CI jobs are migrated from Trusty to Jessie with https://gerrit.wikimedia.org/r/#/c/287649/ Gotta make sure they still work properly." [puppet] - 10https://gerrit.wikimedia.org/r/286873 (owner: 10Hashar) [15:55:14] twentyafterfour: weill do [15:55:15] will* [15:55:39] (03PS3) 10Hashar: contint: move package_builder setup to its own class [puppet] - 10https://gerrit.wikimedia.org/r/286873 (https://phabricator.wikimedia.org/T95545) [15:55:56] (03CR) 10Hashar: "And link to T95545 Migrate all debian-glue jobs to Jessie slaves" [puppet] - 10https://gerrit.wikimedia.org/r/286873 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [15:56:00] :) [15:57:06] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276670 (10mobrovac) >>! In T134464#2270692, @GWicke wrote: > 301s (returned for title normalization) are always safe to cache (and we do send the correspo... [15:57:47] 06Operations, 10Datasets-General-or-Unknown, 06WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#2276685 (10Addshore) https://phabricator.wikimedia.org/T134776 as a followup [15:59:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.9 [debs/linux44] - 10https://gerrit.wikimedia.org/r/287628 (owner: 10Muehlenhoff) [16:00:07] (03PS2) 10Muehlenhoff: Add salt grain for install1001 (for debdeploy) [puppet] - 10https://gerrit.wikimedia.org/r/287616 [16:00:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grain for install1001 (for debdeploy) [puppet] - 10https://gerrit.wikimedia.org/r/287616 (owner: 10Muehlenhoff) [16:02:06] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2276716 (10Gehel) [16:02:08] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2276715 (10Gehel) [16:02:20] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2242648 (10Gehel) [16:02:22] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#1859742 (10Gehel) [16:03:01] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#1859742 (10Gehel) [16:03:03] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2276720 (10Gehel) [16:04:35] 06Operations, 07Icinga: ganeti: PROCS CRITICAL: 2 processes ... - https://phabricator.wikimedia.org/T116111#2276724 (10akosiaris) 05Open>03Resolved Resolved in d5419315 [16:05:14] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2276727 (10MoritzMuehlenhoff) My personal preference would be the stock Ubuntu version (since it's update by standard security support, while the cloud archive often lags a behind) [16:06:21] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276728 (10GWicke) @mobrovac, looking back at your earlier comment I now realize that I misunderstood your concern. The example fix you gave there actually... [16:07:47] 06Operations, 10Phabricator: Set up Yubikey support in Phabricator - https://phabricator.wikimedia.org/T134672#2276737 (10csteipp) This would add Yubi OTP to phabricator as a second factor (from skimming the code, if I'm missing something else, let me know). There isn't much advantage to their OTP method, whi... [16:12:20] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2259160 (10jcrespo) pollux had this issue 2 days ago, sent an email to ops, but it clearly seems related (IO load). [16:12:40] !log restarting elasticsearch server elastic1030.eqiad.wmnet (T110236), includes JDK upgrade [16:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:01] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:13:39] 06Operations, 10Phabricator: Set up Yubikey support in Phabricator - https://phabricator.wikimedia.org/T134672#2276754 (10mmodell) 05Open>03declined @csteipp: Thanks! I'm not attached to the idea, I just saw it and I thought it might be useful. [16:15:19] (03CR) 10Mobrovac: "Given that https://github.com/wikimedia/change-propagation/pull/26 has been merged and will definitely be deployed before we re-enable cha" [puppet] - 10https://gerrit.wikimedia.org/r/287148 (https://phabricator.wikimedia.org/T134456) (owner: 10GWicke) [16:17:07] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276765 (10BBlack) Well, we chose to strip it (or not) before sending the request to RB, before we've seen the response. So we can't logically make the ca... [16:17:54] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2276766 (10Andrew) Typically when I upgrade an openstack host, I follow this brainless process: 1) Increment the version in puppet, which updates the version in my apt sources 2) apt-get update && apt-ge... [16:18:10] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack/Setup 6 new pool servers - https://phabricator.wikimedia.org/T132684#2276767 (10Cmjohnson) 05Open>03Resolved Servers are racked, labeled in switch, racktables updated, google docs updated. [16:19:45] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276773 (10mobrovac) >>! In T134464#2276765, @BBlack wrote: > Well, we chose to strip it (or not) before sending the request to RB, before we've seen the r... [16:21:34] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276780 (10BBlack) So, just to be sure I understand: the problem here is that when both title **and** wikitext redirects apply to a single request, with ?r... [16:26:54] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276812 (10BBlack) Sorry, that comment was written before I read @mobrovac's. So the issue is just ensuring ?redirect=false is preserved in the 301's Loca... [16:28:35] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276816 (10mobrovac) >>! In T134464#2276780, @BBlack wrote: > So, just to be sure I understand: the problem here is that when both title **and** wikitext r... [16:28:52] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2276817 (10mobrovac) >>! In T134464#2276812, @BBlack wrote: > Sorry, that comment was written before I read @mobrovac's. So the issue is just ensuring ?re... [16:34:23] !log restarting elasticsearch server elastic1031.eqiad.wmnet (T110236), includes JDK upgrade [16:34:24] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:29] (03PS3) 10Alex Monk: Revert "Revert "phabricator: Send weekly mail every week instead of on certain monthdays"" [puppet] - 10https://gerrit.wikimedia.org/r/274788 [16:37:56] (03CR) 10Alex Monk: "Hashar: your -1 has stuck, how's the new commit message?" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [16:38:38] (03PS1) 10BBlack: RB: fix redirect=false support for 301 title redirects [puppet] - 10https://gerrit.wikimedia.org/r/287659 (https://phabricator.wikimedia.org/T134464) [16:40:20] gehel: last one, woo! [16:40:40] ebernhardson: but another full cluster restart already planned for JDK upgrade... [16:44:48] (03CR) 10Mobrovac: [C: 031] RB: fix redirect=false support for 301 title redirects [puppet] - 10https://gerrit.wikimedia.org/r/287659 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [16:46:17] (03CR) 10GWicke: [C: 031] RB: fix redirect=false support for 301 title redirects [puppet] - 10https://gerrit.wikimedia.org/r/287659 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [16:46:33] 06Operations, 10Traffic, 10Wikidata, 13Patch-For-Review: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2276929 (10Trung.anh.dinh) Thank you for the fix @BBlack :) [16:46:51] (03PS1) 10Yuvipanda: ldap: Get rid of cleanup-pam-config script [puppet] - 10https://gerrit.wikimedia.org/r/287660 [16:46:53] (03PS1) 10Yuvipanda: ldap: Fix another arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/287661 [16:46:55] (03PS1) 10Yuvipanda: ldap: Remove some more ensure => absents no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/287662 [16:46:57] (03PS1) 10Yuvipanda: [WIP] ldap: Cleanup module [puppet] - 10https://gerrit.wikimedia.org/r/287663 [16:47:32] (03CR) 10Giuseppe Lavagetto: [C: 031] wgCopyUploadProxy: Vary per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 (owner: 10Alexandros Kosiaris) [16:48:09] mutante: ^ I started a cleanup of the ldap module [16:51:23] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2276976 (10MoritzMuehlenhoff) I'm fine either way. The "cloud archive" is Ubuntu-specific anyway, so with a later migration to Debian this issue will vanish anyway. [16:51:56] YuviPanda: https://cncf.io/news/announcement/2016/05/cloud-native-computing-foundation-accepts-prometheus-second-hosted-project [16:52:14] cmjohnson1: quick one! https://gerrit.wikimedia.org/r/#/c/287607/1 - looks good for a merge or am I missing something [16:52:17] ? [16:52:28] godog: nice! [16:53:28] (03CR) 10Elukey: [C: 032] Reserve extra IP addresses for the new AQS hosts. [dns] - 10https://gerrit.wikimedia.org/r/287605 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [16:53:57] (03CR) 10Cmjohnson: [C: 031] "I verified the mac addresses looks good" [puppet] - 10https://gerrit.wikimedia.org/r/287607 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [16:54:08] elukey: do you need me to merge? [16:54:27] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2276996 (10Andrew) ok, I will upgrade the qemu version on all the virt nodes. [16:54:53] cmjohnson1: nono just wanted to know if the procedure that I followed was correct, I'll take care of it :) [16:56:45] YuviPanda: aye, good times! I'll see if I can work on the nginx/reverse proxy tomorrow btw [16:57:51] (03CR) 10Yuvipanda: [C: 04-1] "HTTP edge caching is very alluring, but is a source of more problems than fixes if we do not have an easy invalidation solution (which we " [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [16:58:13] godog: \o/ awesome. then we can easily add it to grafana.wikimedia.org :D [16:59:41] YuviPanda: yup, I'm about to jump into a meeting but I was getting 'access denied' when trying to use the token to talk to apiserver, you can see that in the "Status" tab in the ui [16:59:44] (03PS2) 10Yurik: Removed obsolete graph ext settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287299 [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160509T1700). Please do the needful. [17:00:09] godog: ok I'll take a look [17:00:38] ok thanks! [17:01:38] (03PS2) 10Elukey: Add aqs100[345] DCHP configuration. [puppet] - 10https://gerrit.wikimedia.org/r/287607 (https://phabricator.wikimedia.org/T133785) [17:05:07] Nothing planned for this week WDQS deployment. SMalyshev let me know if I'm wrong with this and I'll push whatever you need... [17:05:49] !log cluster restart completed for eqiad / codfw elasticsearch (T110236)$ [17:05:55] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:12] (03CR) 10Elukey: [C: 032] Add aqs100[345] DCHP configuration. [puppet] - 10https://gerrit.wikimedia.org/r/287607 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [17:07:06] !log executed authdns-update on ns0.w.o to introduce new aqs records [17:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:41] (03PS2) 10BBlack: RB: fix redirect=false support for 301 title redirects [puppet] - 10https://gerrit.wikimedia.org/r/287659 (https://phabricator.wikimedia.org/T134464) [17:08:48] (03CR) 10BBlack: [C: 032 V: 032] RB: fix redirect=false support for 301 title redirects [puppet] - 10https://gerrit.wikimedia.org/r/287659 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [17:09:58] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2277093 (10TerraCodes) [17:12:15] gehel: can we do that at the same time as the es 2.x upgrade? [17:12:48] ebernhardson: do we have any timeline on the es 2.x upgrade? It is still a bit unclear for me... [17:13:36] gehel: aiming for end of may/begining of june. The cirrus search / plugin end of things is pretty much done now. We are into the testing phase now mostly [17:14:33] ebernhardson: I did not follow the discussions closely... do we have a way to run on both 1.x and 2.x? Or do we know how to handle the switch between the versions? [17:15:24] gehel: that's what we are figuring out between now and then :) Based on the changes we've had to make though it looks like the update process hasn't changed, so we can keep writing to both clusters but will have to flip/flop the cluster we search from when rolling out the code [17:15:33] ebernhardson: And I'm actually not joking when I'm saying that I'd like to have a constant cluster restart... the idea is almost mature in my head, I "just" need to transform it into working code... [17:15:55] JohanJ: Can you wait with Tech/news delivery for dewiki a bit? I forget it and will start translation now ;) [17:16:26] ebernhardson: if I can get a constant restart in place, that would kill this discussion of knowing when we do those restart (answer: all the time!) [17:19:05] gehel: you crazy :) [17:19:12] but if it works, sure [17:21:13] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274165 (10Dzahn) It seems these servers are not in site.pp yet (but they are in Icinga, so getting puppet default classes). They should be added to clarify... [17:22:21] !log analytics1001 Yarn+HDFS masters failed over to analytics1002 for Java upgrades [17:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:47] (03PS2) 10Gehel: Don't start Jolokia each time, let Updater start it [puppet] - 10https://gerrit.wikimedia.org/r/287131 (https://phabricator.wikimedia.org/T134523) (owner: 10Smalyshev) [17:23:45] gehel: this is Blazegraph collector: https://github.com/wikimedia/operations-puppet/blob/production/modules/wdqs/files/monitor/blazegraph.py [17:24:28] gehel: updater doesn't have the API, so we use Jolokia to create HTTP endpoint for counters. [17:24:40] (03CR) 10Krinkle: [C: 031] contint: drop role::ci::slave::labs::light [puppet] - 10https://gerrit.wikimedia.org/r/287648 (owner: 10Hashar) [17:24:40] SMalyshev: I think I had a look at this at some point. Blazegraph only re-exposes some of the JMX metrics, not all of them. [17:25:00] * gehel needs to dig more, see if there is something interesting hidden behind the scene [17:25:07] (03PS2) 10Dzahn: contint: drop role::ci::slave::labs::light [puppet] - 10https://gerrit.wikimedia.org/r/287648 (owner: 10Hashar) [17:25:11] before, we started/stopped it each time we need counters, and that causes FD leak. But we can start the jolokia agent when we start updater and just leave it running [17:25:14] (03CR) 10Gehel: [C: 032] Don't start Jolokia each time, let Updater start it [puppet] - 10https://gerrit.wikimedia.org/r/287131 (https://phabricator.wikimedia.org/T134523) (owner: 10Smalyshev) [17:25:18] (03CR) 10Dzahn: [C: 032] contint: drop role::ci::slave::labs::light [puppet] - 10https://gerrit.wikimedia.org/r/287648 (owner: 10Hashar) [17:25:54] gehel: we need to check the collector still works though... which probably needs you as I don't have access to these files [17:26:03] e.g. diamond logs [17:26:20] SMalyshev: yep, I'll have a look. Deploying right now... [17:26:33] thanks! [17:26:47] (03PS3) 10Dzahn: contint: drop role::ci::slave::labs::light [puppet] - 10https://gerrit.wikimedia.org/r/287648 (owner: 10Hashar) [17:29:19] SMalyshev: diamond restarted on wdqs1001 and stopped throwing errors about jolokia, so looks good. I'll still check graphite to see if we have data again [17:30:43] SMalyshev: How do we publish the metrics inside the updater? Do we have some kind of histogram internally? [17:30:53] !log analytics1002 Yarn+HDFS masters failed over to analytics1001 for Java upgrades (restored original state) [17:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:27] (03PS17) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [17:32:19] SMalyshev: I have my answer... we use dropwizard metrics... That actually skew the data... [17:34:51] JohanJ: I'm finished [17:36:13] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2277197 (10yuvipanda) No work has been done on them yet, and so they aren't in site.pp. Should be unrelated to the access request itself though. [17:37:24] gehel: we use com.codahale.metrics, whatever it is :) [17:37:57] !log camus+puppet re-enabled on analytics1027 after maintenance [17:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:05] and JmxReporter from there [17:38:22] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [17:42:58] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2277235 (10Dzahn) It is related to the access request because we grant access via puppet, by putting admin groups on roles in hiera. Also servers that are not... [17:44:38] (03CR) 10Dzahn: [C: 032] ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [17:44:52] hmm akosiaris any tips for troubleshooting reprepro update? [17:45:01] just went to look for the gerrit rest api endpoint and cant find it! is it disabled? [17:49:52] Luke081515: thanks. :) [17:51:20] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2277243 (10Riley_Huntley) A database query error has occurred. This may indicate a... [18:03:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:03:38] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:39] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:38] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:12:03] anomie: o/ hello! i'm not sure you're the right person to ask but we're seeing some really weird auth issues in the app and i was hoping you could help me or point me to someone who can [18:12:28] niedzielski: What's the problem? [18:12:28] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:12:56] anomie: we're able to use curl to authenticate fine in the same manner as we do in app. unfortunately, in app the socket is timing out in an unusual way we haven't seen before [18:13:29] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:13:37] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:13:59] anomie: the android app hasn't changed in a while and the ios app also seems to be experiencing similar issues [18:15:18] niedzielski: Doesn't sound like anything I'd have any idea about. Sounds like some network-layer issue rather than anything in MediaWiki, unless you're somehow making a request that hangs MediaWiki and makes your sockets time out in a weird way. [18:15:27] anomie: our working theory is that our networking library isn't playing nicely with nginx over an http/2 connection: https://github.com/square/okhttp/issues/2543 [18:17:11] niedzielski: and i just confirmed that excluding http/2 from the OkHttpClient's protocols list fixes our issue -- i guess we'd still need to know from the opsen if/when there will be adverse consequences of avoiding http/2 while nginx/okhttp get fixes in for the underlying problem [18:22:39] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274165 (10RobH) The operations meeting discussion (as I understood it) was that we need to know the exact items that @madhuvishy plans to run and access, as... [18:22:47] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2277367 (10matmarex) @Riley_Huntley Were you shown an exception identifier like 'Vx... [18:23:11] anomie mdholloway coreyfloyd: who would be a good person to ping on forcing HTTP1.1 usage? [18:23:54] niedzielski: i think i've seen bblack's name on the relevant patches [18:24:09] make a ticket and add the tag "traffic" [18:24:38] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2277377 (10chasemp) [18:25:19] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2277392 (10RobH) I imagine the smaller capacity disks were installed by accident, as we have on shelf s... [18:25:34] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2277394 (10valhallasw) We still provide redirects, and I think those are still used. Can we use the new letsencrypt-for-simple-frontends manifests to keep this online? [18:25:43] niedzielski: if you need to disable http/2 in your client because the client has http/2 bugs, I don't think there's anything I can/should do about that [18:26:25] niedzielski: there's already as many consequences as there ever will be: if you don't use http/2, you use http/1 and lose some performance and connection coalescing (which is also just a perf hack) [18:26:44] bblack: thanks :) [18:26:49] (03PS4) 10EBernhardson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) [18:29:06] (03PS8) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [18:30:58] (03PS5) 10Dzahn: ircserver: don't use TS6 protocol, no other servers [puppet] - 10https://gerrit.wikimedia.org/r/286785 (https://bugzilla.wikimedia.org/134271) [18:32:48] (03CR) 10Dzahn: [C: 032] ircserver: don't use TS6 protocol, no other servers [puppet] - 10https://gerrit.wikimedia.org/r/286785 (https://bugzilla.wikimedia.org/134271) (owner: 10Dzahn) [18:32:57] (03PS2) 10Andrew Bogott: Remove ldap/dns services from labcontrol1001 and labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/287235 (https://phabricator.wikimedia.org/T126758) [18:33:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5981090 keys - replication_delay is 38 [18:33:48] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:35:04] (03CR) 10Andrew Bogott: [C: 032] Remove ldap/dns services from labcontrol1001 and labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/287235 (https://phabricator.wikimedia.org/T126758) (owner: 10Andrew Bogott) [18:37:58] !log argon - restarting ircd (this is the old server) [18:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:16] (03CR) 10Dzahn: "restarted ircd on argon after this and the previous config change. confirmed still working" [puppet] - 10https://gerrit.wikimedia.org/r/286785 (https://bugzilla.wikimedia.org/134271) (owner: 10Dzahn) [18:41:37] PROBLEM - Auth DNS on labs-ns0-former-placeholder.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:41:41] !log irc.wm.org - before restarting ircd on old, ~ 199 users on new, after: ~ 293 users on new [18:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:07] PROBLEM - Auth DNS on labs-ns1-former-placeholder.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:43:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 642 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5981090 keys - replication_delay is 642 [18:44:27] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271#2277452 (10Dzahn) also: https://gerrit.wikimedia.org/r/#/c/286785/ merged as of today we had ~ 96 users left on the old server and 191 users on the... [18:45:41] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2277457 (10Dzahn) as of today we had ~ 96 users left on the old server and 191 users on the new server after merging the last 2 ircd config changes i restarted ircd on the old server and right after we have now 293 users on new server, so c... [18:46:21] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2277460 (10Dzahn) I will proceed shutting this down.. [18:47:21] (03PS2) 10Andrew Bogott: Purge labs dns/ldap code [puppet] - 10https://gerrit.wikimedia.org/r/287236 (https://phabricator.wikimedia.org/T126758) [18:49:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:49:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:50:13] 06Operations, 06Performance-Team, 10Traffic: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2277474 (10BBlack) 05Open>03Resolved a:03BBlack Closing this for now: we tried enabling streaming in the text cluster for pass-traffic, and all it ended up do... [18:51:35] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: commons.wikimedia.org home page has 404s loaded from JS (RL?) - https://phabricator.wikimedia.org/T134368#2277480 (10BBlack) Ping. Still seeing these 404s on https://commons.wikimedia.org/ . ... [18:53:46] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277493 (10BBlack) I saw the deprecation email about RESTBase itself. Are we clear to pull the DNS for... [18:55:18] 06Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2277496 (10BBlack) Ping - Reading, Mobile, someone should know what's causing this, right? [18:56:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:56:56] 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2277505 (10Multichill) Did you guys do the Wikidata specific magic? @VIGNERON reported at the Wikidata Village Pump that he is unable to add a... [18:57:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:58:02] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (10GWicke) @bblack, if there is no specific reason to pull the DNS soon, I would propose to keep... [18:59:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 708 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5981188 keys - replication_delay is 708 [18:59:34] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277517 (10BBlack) Ok [19:00:40] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277537 (10BBlack) And actually, on the issue of the `.eqiad.` hostnames: they've been broken for a whil... [19:01:54] (03PS1) 10Ottomata: Fix reprepro updates entry for confluent kafka [puppet] - 10https://gerrit.wikimedia.org/r/287678 (https://phabricator.wikimedia.org/T121562) [19:03:18] 06Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2277546 (10Jdlrobson) 05Open>03Invalid This is due to MobileContext::isBlacklistedPage and this: https://github.com/wikimedia/operations-me... [19:03:35] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277548 (10BBlack) >>! In T133001#2277513, @GWicke wrote: > @bblack, if there is no specific reason to p... [19:04:21] (03CR) 10Ottomata: [C: 032] Fix reprepro updates entry for confluent kafka [puppet] - 10https://gerrit.wikimedia.org/r/287678 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [19:09:38] (03PS1) 10Jdlrobson: Enable lazy loaded images on bewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) [19:10:13] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277566 (10GWicke) @bblack, restbase.wikimedia.org was never published or in any way supposed to work. A... [19:10:32] (03PS1) 10BBlack: Remove (citoid|cxserver|restbase).eqiad.wm.o hostnames [dns] - 10https://gerrit.wikimedia.org/r/287682 (https://phabricator.wikimedia.org/T133001) [19:12:12] (03PS5) 10EBernhardson: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) [19:13:13] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2277584 (10GWicke) [19:13:14] (03PS2) 10Jdlrobson: Enable $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286505 (owner: 10Brion VIBBER) [19:13:17] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2277581 (10GWicke) 05Open>03Resolved a:03GWicke Thank you, @bblack! [19:14:58] (03PS3) 10Jdlrobson: Enable $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286505 (https://phabricator.wikimedia.org/T134115) (owner: 10Brion VIBBER) [19:15:25] 06Operations, 13Patch-For-Review: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie - https://phabricator.wikimedia.org/T134773#2277591 (10hashar) That fixed it on integration-slave-jessie1001 :-} OCG would also need a fix for whenever it is switched to Jessie. ``` modul... [19:20:14] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: commons.wikimedia.org home page has 404s loaded from JS (RL?) - https://phabricator.wikimedia.org/T134368#2277600 (10Krinkle) I can't reproduce this. Wikimedia app servers haven't produced `/static/1.27.0-$version/` urls since February 26 ([config diff](ht... [19:20:48] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277601 (10BBlack) Ok got it, thanks! [19:22:28] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2277605 (10Krinkle) [19:24:08] (03PS2) 10BBlack: Remove deprecated/dysfunctional wm.o service hostnames [dns] - 10https://gerrit.wikimedia.org/r/287682 (https://phabricator.wikimedia.org/T133001) [19:24:33] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:25:59] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2277614 (10Krinkle) While we waited the usual 30 days before removing the symlinks for old branches, they were still cached client side (which has no e... [19:26:24] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:28:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:32:52] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2277677 (10BBlack) Ok. I can see that they're gone in a fresh browser install or incognito view. But I've definitely closed the whole browser app sin... [19:33:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:33:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:37:07] (03CR) 10GWicke: [C: 031] Remove deprecated/dysfunctional wm.o service hostnames [dns] - 10https://gerrit.wikimedia.org/r/287682 (https://phabricator.wikimedia.org/T133001) (owner: 10BBlack) [19:42:53] 06Operations, 10DNS, 10Traffic: Replace test hostnames in datecenter-specific subdomains with dashed names - https://phabricator.wikimedia.org/T134807#2277734 (10BBlack) [19:43:04] 06Operations, 10DNS, 10Traffic: Replace test hostnames in datecenter-specific subdomains with dashed names - https://phabricator.wikimedia.org/T134807#2277747 (10BBlack) p:05Triage>03Low [19:44:04] /22/22 [19:44:52] (03CR) 10BBlack: [C: 032] Remove deprecated/dysfunctional wm.o service hostnames [dns] - 10https://gerrit.wikimedia.org/r/287682 (https://phabricator.wikimedia.org/T133001) (owner: 10BBlack) [19:46:06] (03CR) 10Addshore: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269467 (owner: 10Addshore) [19:47:09] (03PS2) 10Rush: labs pdns updates [puppet] - 10https://gerrit.wikimedia.org/r/287233 [19:48:11] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [19:48:17] (03PS1) 10Dzahn: remove argon from site.pp and install_server [puppet] - 10https://gerrit.wikimedia.org/r/287685 (https://phabricator.wikimedia.org/T134223) [19:49:52] (03PS2) 10Dzahn: remove argon from site.pp and install_server [puppet] - 10https://gerrit.wikimedia.org/r/287685 (https://phabricator.wikimedia.org/T134223) [19:50:20] (03PS1) 10BBlack: Text VCL: tighten up rest|cxserver|citoid hostname matching [puppet] - 10https://gerrit.wikimedia.org/r/287686 [19:50:43] (03PS2) 10BBlack: Text VCL: tighten up rest|cxserver|citoid hostname matching [puppet] - 10https://gerrit.wikimedia.org/r/287686 [19:51:02] (03CR) 10Rush: [C: 032] labs pdns updates [puppet] - 10https://gerrit.wikimedia.org/r/287233 (owner: 10Rush) [19:54:22] (03Abandoned) 10Dzahn: add base::firewall on argon, remove from role::mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/287018 (owner: 10Dzahn) [19:54:38] 07Puppet, 10Continuous-Integration-Infrastructure: Puppet fails on Jessie permanent CI slaves - https://phabricator.wikimedia.org/T134808#2277778 (10hashar) [19:54:59] (03PS3) 10Dzahn: remove argon from site.pp and install_server [puppet] - 10https://gerrit.wikimedia.org/r/287685 (https://phabricator.wikimedia.org/T134223) [19:55:49] (03PS1) 10Ottomata: Use deployment-kafka01 instead of deployment-kafka02 - 01 is jessie [puppet] - 10https://gerrit.wikimedia.org/r/287688 [19:56:05] (03PS2) 10Ottomata: Use deployment-kafka01 instead of deployment-kafka02 - 01 is jessie [puppet] - 10https://gerrit.wikimedia.org/r/287688 [19:56:17] (03CR) 10Ottomata: [C: 032 V: 032] Use deployment-kafka01 instead of deployment-kafka02 - 01 is jessie [puppet] - 10https://gerrit.wikimedia.org/r/287688 (owner: 10Ottomata) [19:56:20] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [19:56:51] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [19:56:53] (03PS1) 10Rush: labs dns recursor template fix var [puppet] - 10https://gerrit.wikimedia.org/r/287689 [19:57:05] puppet fail there is me, fix incoming [19:57:39] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [19:57:52] (03PS2) 10Rush: labs dns recursor template fix var [puppet] - 10https://gerrit.wikimedia.org/r/287689 [19:58:20] !log demon@tin Synchronized php-1.27.0-wmf.22/extensions/NavigationTiming/: backport firstPaintTime fix (duration: 00m 33s) [19:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:12] (03CR) 10Rush: [C: 032] labs dns recursor template fix var [puppet] - 10https://gerrit.wikimedia.org/r/287689 (owner: 10Rush) [19:59:45] (03CR) 10Rush: [V: 032] labs dns recursor template fix var [puppet] - 10https://gerrit.wikimedia.org/r/287689 (owner: 10Rush) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160509T2000). [20:00:39] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: puppet fail [20:00:50] gilles: NavTiming fix is live. [20:01:02] If things look ok in an hour or two I'll deploy wmf.23 [20:01:18] ostriches: I'll keep an eye on it and keep you posted [20:01:29] (03PS1) 10BBlack: text VCL: host header is already downcased in shared normalization [puppet] - 10https://gerrit.wikimedia.org/r/287694 [20:02:41] (03PS1) 10Hashar: apache: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287695 (https://phabricator.wikimedia.org/T134808) [20:03:17] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:03:26] (03PS1) 10Ottomata: Move deployment-prepzookeeper and kafka configs from ops/puppet to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/287696 [20:03:54] (03CR) 10Ottomata: [C: 032 V: 032] Move deployment-prepzookeeper and kafka configs from ops/puppet to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/287696 (owner: 10Ottomata) [20:04:57] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: puppet fail [20:04:57] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:05:30] (03PS2) 10BBlack: text VCL: host header is already downcased in shared normalization [puppet] - 10https://gerrit.wikimedia.org/r/287694 [20:05:41] (03PS1) 10Gehel: Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 [20:06:08] (03CR) 10Dzahn: [C: 032] remove argon from site.pp and install_server [puppet] - 10https://gerrit.wikimedia.org/r/287685 (https://phabricator.wikimedia.org/T134223) (owner: 10Dzahn) [20:06:17] (03PS4) 10Dzahn: remove argon from site.pp and install_server [puppet] - 10https://gerrit.wikimedia.org/r/287685 (https://phabricator.wikimedia.org/T134223) [20:06:26] (03PS2) 10Gehel: Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 [20:07:16] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [20:08:13] (03CR) 10jenkins-bot: [V: 04-1] Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 (owner: 10Gehel) [20:09:44] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2275897 (10RobH) When these are really for total shutdown, please assign to me so we can figure out how much we're going to reclaim for parts, and how many will be decommissioned and... [20:10:34] (03PS3) 10BBlack: Text VCL: tighten up rest|cxserver|citoid hostname matching [puppet] - 10https://gerrit.wikimedia.org/r/287686 [20:11:04] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: tighten up rest|cxserver|citoid hostname matching [puppet] - 10https://gerrit.wikimedia.org/r/287686 (owner: 10BBlack) [20:13:16] (03PS3) 10BBlack: text VCL: host header is already downcased in shared normalization [puppet] - 10https://gerrit.wikimedia.org/r/287694 [20:13:35] (03CR) 10BBlack: [C: 032 V: 032] text VCL: host header is already downcased in shared normalization [puppet] - 10https://gerrit.wikimedia.org/r/287694 (owner: 10BBlack) [20:13:45] (03PS1) 10Dzahn: Revert "remove argon from site.pp and install_server" [puppet] - 10https://gerrit.wikimedia.org/r/287717 [20:13:57] (03PS2) 10Dzahn: Revert "remove argon from site.pp and install_server" [puppet] - 10https://gerrit.wikimedia.org/r/287717 [20:14:14] (03PS1) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:14:19] 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances using apache::site puppet class - https://phabricator.wikimedia.org/T134808#2277814 (10hashar) [20:14:35] 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances using apache::site puppet class - https://phabricator.wikimedia.org/T134808#2277778 (10hashar) p:05Triage>03Normal a:03hashar [20:15:01] (03CR) 10Dzahn: [C: 032] "nothing bad, just wanted to first use it to test changes to udpmxircecho, then do this" [puppet] - 10https://gerrit.wikimedia.org/r/287717 (owner: 10Dzahn) [20:15:03] (03PS2) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:15:14] (03PS1) 10Catrope: Enable Flow opt-in beta feature on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287730 (https://phabricator.wikimedia.org/T132693) [20:15:30] (03CR) 10Dzahn: [V: 032] Revert "remove argon from site.pp and install_server" [puppet] - 10https://gerrit.wikimedia.org/r/287717 (owner: 10Dzahn) [20:15:37] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:15:45] (03CR) 10Hashar: [C: 031] "On both CI and beta labs projects I have mass cleaned ganglia:" [puppet] - 10https://gerrit.wikimedia.org/r/287695 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [20:16:05] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2277829 (10Krinkle) >>! In T134368#2277677, @BBlack wrote: > Ok. I can see that they're gone in a fresh browser install or incognito view. But I've d... [20:16:10] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2277833 (10Smalyshev) [20:16:17] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (10Krinkle) > restbase 600 IN DYNA geoip!text-addrs Despite it not being documented,... [20:17:20] (03PS1) 10Ottomata: Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 [20:17:39] (03CR) 10jenkins-bot: [V: 04-1] change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [20:17:40] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 06Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#2277846 (10thcipriani) We discussed this ticket WRT to scap3 work briefly in the deployment working group meeting today (https://w... [20:18:31] (03PS1) 10Dzahn: remove duplicate base::firewall from kraz [puppet] - 10https://gerrit.wikimedia.org/r/287742 [20:18:37] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2277851 (10Ottomata) [20:18:53] (03CR) 10jenkins-bot: [V: 04-1] change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [20:19:07] (03PS2) 10Dzahn: remove duplicate base::firewall from kraz [puppet] - 10https://gerrit.wikimedia.org/r/287742 [20:19:21] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2277852 (10BBlack) If something important breaks, we can put that name back in. But in that case, we sh... [20:19:31] 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances using apache::site puppet class - https://phabricator.wikimedia.org/T134808#2277778 (10hashar) [20:19:46] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1881753 (10Ottomata) Upgraded the analytics Kafka cluster in deployment-prep today. Along the way I had to create an extra... [20:21:08] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2277855 (10BBlack) So, how do we get rid of those stored broken URIs? It seems like a lot of clients will be stuck with them for a very long time... [20:21:12] 06Operations, 06Performance-Team, 07Availability: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2277856 (10aaron) [20:21:28] (03PS3) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:22:46] (03PS1) 10Hashar: hhvm: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287743 (https://phabricator.wikimedia.org/T134808) [20:23:48] (03PS4) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:24:07] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:24:41] (03CR) 10Hashar: [C: 031] "Cherry picked on beta cluster puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/287743 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [20:25:56] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:26:12] (03CR) 10jenkins-bot: [V: 04-1] change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [20:26:18] !log restarting logstash server logstash1001.eqiad.wmnet (T110236)upgrade [20:26:19] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:56] (03CR) 10jenkins-bot: [V: 04-1] change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [20:28:28] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:28:43] (03CR) 10Hashar: [C: 031] "Also cherry picked on integration-puppetmaster for CI." [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar) [20:28:54] (03PS4) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 [20:29:26] (03CR) 10Andrew Bogott: [C: 032] "Puppet compiler confirms that this is a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/287236 (https://phabricator.wikimedia.org/T126758) (owner: 10Andrew Bogott) [20:29:31] !log restarting logstash server logstash100[26].eqiad.wmnet (T110236) [20:29:32] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:29:32] (03PS3) 10Andrew Bogott: Purge labs dns/ldap code [puppet] - 10https://gerrit.wikimedia.org/r/287236 (https://phabricator.wikimedia.org/T126758) [20:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:29] 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class) - https://phabricator.wikimedia.org/T134808#2277894 (10hashar) [20:33:11] (03CR) 10Faidon Liambotis: [C: 031] graphite: export /var/lib/carbon via rsync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287608 (owner: 10Filippo Giunchedi) [20:33:17] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:33:45] (03PS5) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:34:54] (03PS6) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:36:15] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2277910 (10hashar) [20:36:18] 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class) - https://phabricator.wikimedia.org/T134808#2277911 (10hashar) [20:36:25] !log starting mobileapps deployment [20:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:41] (03PS1) 10Faidon Liambotis: kafka: adjust README.md with better-looking hostnames [puppet] - 10https://gerrit.wikimedia.org/r/287747 [20:36:55] (03PS7) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [20:37:07] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192864 (10hashar) Puppet failures related to Ganglia would be due to T134808 which have fixed cherry picked on beta puppet master but do not cover every cases. [20:37:08] (03CR) 10Faidon Liambotis: [C: 032 V: 032] kafka: adjust README.md with better-looking hostnames [puppet] - 10https://gerrit.wikimedia.org/r/287747 (owner: 10Faidon Liambotis) [20:38:20] (03PS1) 10Andrew Bogott: Use --delete when rsyncinc images from the nova controller to the spare. [puppet] - 10https://gerrit.wikimedia.org/r/287748 [20:38:28] (03CR) 10Faidon Liambotis: "The kafka README thing wasn't pmtpa-related. Fixed with I1bff91099dc5d05403fdce1affc6cefa615c1d82. You would be right about the pmtpa host" [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [20:38:45] (03PS2) 10Faidon Liambotis: typos: validate hostname number based on DC [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [20:38:52] ;-} [20:39:05] 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class) - https://phabricator.wikimedia.org/T134808#2277778 (10Dzahn) @hashar re: ganglia on labs, also see T115330. ganglia-moni... [20:39:29] paravoid: I thought about test coverage for /typos but then that would probably be overkill [20:39:29] hashar: ganglia-monitor should juust be killed on all instances [20:39:35] hashar: https://phabricator.wikimedia.org/T115330#2244694 [20:39:37] mutante: yeah that is what I have done [20:39:41] that is a list of IPs [20:39:45] but it is ensure => absent [20:39:49] (03CR) 10Smalyshev: [C: 031] Publishes non aggregated metrics for WDQS updater (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287697 (owner: 10Gehel) [20:39:50] that had puppet or other trouble [20:39:59] (03PS2) 10Andrew Bogott: Use --delete when rsyncing images from the nova controller to the spare. [puppet] - 10https://gerrit.wikimedia.org/r/287748 [20:40:01] so a bunch of configuration files are left behind ( rc ganglia-monitor) such as /etc/ganglia [20:40:03] so it did not get auto-removed [20:40:11] ah, yes [20:40:15] so nothing prevented the various puppet manifests to add stuff under /etc/ganglia since it was kept behind [20:40:32] *nod*.. [20:40:33] (03PS3) 10Andrew Bogott: Use --delete when rsyncing images from the nova controller to the spare. [puppet] - 10https://gerrit.wikimedia.org/r/287748 [20:40:39] and on a couple Precise instances gmond was still running ... [20:40:56] and reporting to prod ganglia :p [20:41:02] as "misc" clustetr [20:41:18] let me check how many are left [20:41:45] (03CR) 10Gehel: Publishes non aggregated metrics for WDQS updater (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287697 (owner: 10Gehel) [20:41:49] (03CR) 10Rush: [C: 031] Use --delete when rsyncing images from the nova controller to the spare. [puppet] - 10https://gerrit.wikimedia.org/r/287748 (owner: 10Andrew Bogott) [20:41:51] 16 [20:42:13] greatly reduced though [20:42:30] (03PS4) 10Gehel: Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 [20:42:59] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10hashar) As part of T134808 on CI and beta cluster I have ran: ``` salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia' ``` Some... [20:43:06] mutante: I have summarized on your task ;} [20:43:21] I have purged it from both CI and beta [20:43:34] (03CR) 10Andrew Bogott: [C: 032] Use --delete when rsyncing images from the nova controller to the spare. [puppet] - 10https://gerrit.wikimedia.org/r/287748 (owner: 10Andrew Bogott) [20:43:37] and for the list of IP dig -x might give you the hostname [20:43:49] hashar: thank you:) the list of IPs is here at the top https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous%2520eqiad&tab=m&vn=&hide-hf=false [20:44:07] i will check later if they disappear [20:44:22] (03CR) 10Smalyshev: [C: 031] Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 (owner: 10Gehel) [20:44:25] oh [20:45:10] 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2277957 (10Krenair) Pretty sure I ran the site table update script. @hoo/@aude ? [20:45:24] (03PS5) 10Gehel: Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 [20:45:49] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2277959 (10Krenair) [20:46:51] (03CR) 10Gehel: [C: 032] Publishes non aggregated metrics for WDQS updater [puppet] - 10https://gerrit.wikimedia.org/r/287697 (owner: 10Gehel) [20:48:29] mutante: intégration-raita has a stall puppet . Going to fix it [20:48:49] (03CR) 10Faidon Liambotis: [C: 04-1] "A few inline comments. This isn't bad at all :)" (038 comments) [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [20:49:10] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2277976 (10Krinkle) The proper solution is to roll out pure module-content based versioning to file modules (which covers jquery-ui). More specifically... [20:49:11] hashar: :) thanks [20:49:47] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [20:51:11] ^looking [20:51:56] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [20:52:36] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 41, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 55, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 120, initializing_shards: 4, number_of_data_nodes: 3, [20:53:17] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 41, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 55, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 120, initializing_shards: 4, number_of_data_nodes: 3, [20:53:21] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2277990 (10Pchelolo) [20:53:27] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 41, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 55, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 120, initializing_shards: 4, number_of_data_nodes: 3, [20:53:29] (03PS9) 10Hashar: ci: Role for running Raita [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [20:53:30] hashar: so the test succeeds now, but 1m is indeed very long [20:53:34] ^logstash issue is mine... sorry disabled the wrong alerts... [20:53:36] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 41, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 55, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 120, initializing_shards: 4, number_of_data_nodes: 3, [20:53:48] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 45 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 41, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 55, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 120, initializing_shards: 4, number_of_data_nodes: 3, [20:54:26] (03PS4) 10Dzahn: udpmxircecho: use a config file [puppet] - 10https://gerrit.wikimedia.org/r/287246 (owner: 10Alex Monk) [20:54:43] gehel: ^ issues w/ elastic? ebernhardson? [20:54:51] ^logstash issue is mine... sorry disabled the wrong alerts... [20:55:20] (03CR) 10Hashar: "Rebased and moved the role under /modules/role/" [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [20:55:27] restart in progress of the logstash cluster (T110236). Under control... [20:55:28] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:56:27] (03CR) 10Dzahn: [C: 032] udpmxircecho: use a config file [puppet] - 10https://gerrit.wikimedia.org/r/287246 (owner: 10Alex Monk) [20:56:45] (03PS5) 10Dzahn: udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 (owner: 10Alex Monk) [20:56:54] (03CR) 10Dzahn: [C: 032] udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 (owner: 10Alex Monk) [20:58:22] !log mobileapps deployed f206e94 [20:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:40] mutante: and integration-raita should no more emit :-} [20:59:00] hashar: great:) and thanks for moving roles to modules/role [21:00:32] oh I have just rebased a very outdated patch :D [21:01:14] (03CR) 10Dzahn: "tested on argon. ran puppet ... killed process. ran puppet.. joined the IRC server again just fine." [puppet] - 10https://gerrit.wikimedia.org/r/287246 (owner: 10Alex Monk) [21:04:45] !log clean out old snapshots on labstore1001 [21:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:30] sleep well! [21:13:04] (03PS1) 10Dzahn: Revert "Revert "remove argon from site.pp and install_server"" [puppet] - 10https://gerrit.wikimedia.org/r/287753 [21:14:07] (03CR) 10Dereckson: [C: 04-1] "bewiki isn't Bengali but Belarusian language." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [21:15:17] (03PS2) 10Jdlrobson: Enable lazy loaded images on bewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) [21:15:27] (03PS3) 10Jdlrobson: Enable lazy loaded images on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) [21:15:39] (03CR) 10Jdlrobson: "Dereckson good catch :) Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [21:15:57] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2278111 (10elukey) Just had a quick chat on #memcached with dormando, a lot of useful info: 1) with 1.4.25 the max slab class count dropped to 63 and since our... [21:16:33] (03PS2) 10Dzahn: Revert "Revert "remove argon from site.pp and install_server"" [puppet] - 10https://gerrit.wikimedia.org/r/287753 [21:16:55] (03CR) 10Dzahn: [C: 032] "tests are done. shutting it down now" [puppet] - 10https://gerrit.wikimedia.org/r/287753 (owner: 10Dzahn) [21:17:20] (03CR) 10Dereckson: "bn for Bengali, bg is for Bulgarian" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [21:17:30] (03CR) 10Bmansurov: [C: 04-1] Enable lazy loaded images on bgwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [21:17:43] lol [21:17:47] too many b's [21:17:56] now you know all your b? [21:18:23] (03PS4) 10Jdlrobson: Enable lazy loaded images on bengali wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) [21:18:32] Dereckson: :) [21:20:50] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [21:22:34] (03PS3) 10Hashar: typos: validate hostname number based on DC [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) [21:23:13] (03CR) 10Hashar: "I have updated the job ( https://gerrit.wikimedia.org/r/287755 ) to exclude the .git directory." [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [21:25:08] (03CR) 10Hashar: "typos job is now slightly faster than operations-puppet-puppetlint-strict" [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [21:25:56] (03PS3) 10Dzahn: remove duplicate base::firewall from kraz [puppet] - 10https://gerrit.wikimedia.org/r/287742 [21:27:31] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/2708/" [puppet] - 10https://gerrit.wikimedia.org/r/287742 (owner: 10Dzahn) [21:27:47] (03PS4) 10Faidon Liambotis: typos: validate hostname number based on DC [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [21:28:09] (03CR) 10Faidon Liambotis: [C: 032] typos: validate hostname number based on DC [puppet] - 10https://gerrit.wikimedia.org/r/287614 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [21:28:47] ostriches: https://phabricator.wikimedia.org/T134553#2278207 I have enough data, you can go ahead with the wmf23 deployment [21:29:09] Okie dokie, lemme wrap up a few things and then do it [21:36:16] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2278260 (10hashar) operations/puppet.git has some harness now. For mediawiki-config either we duplicate the typo file or I can craft a job... [21:37:57] !log argon - revoke puppet cert, stop salt [21:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:39:58] (03PS1) 10Chad: Moving wikipedias back to wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287759 [21:41:12] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2278296 (10demon) >>! In T133047#2278260, @hashar wrote: > operations/puppet.git has some harness now. > > For mediawiki-config either we d... [21:42:18] !log argon - scheduling eternal downtime, shut down | https://en.wiktionary.org/wiki/good_riddance#Etymology [21:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:46:50] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5997141 keys - replication_delay is 0 [21:48:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:48:32] 06Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2278322 (10Milimetric) I don't remember any specific advice from the Imply folks on this, so sure, 30G for / and the rest on RAID 10 sounds good to me. We'll probably archive old segments on H... [21:49:40] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2278337 (10Dzahn) [21:51:16] "An error has occurred while searching: Search is currently too busy. Please try again later." [21:51:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:52:25] (03PS1) 10Dzahn: remove argon's public IP [dns] - 10https://gerrit.wikimedia.org/r/287764 (https://phabricator.wikimedia.org/T134223) [21:52:54] 14:00 < gehel> restart in progress of the logstash cluster (T110236). Under control... [21:53:01] eh, wrong paste [21:53:23] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [21:53:47] i wanted: 09:39 < gehel> !log restarting elasticsearch server elastic1031.eqiad.wmnet (T110236), includes JDK upgrade [21:53:47] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [21:53:49] (03CR) 10Chad: [C: 032] Moving wikipedias back to wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287759 (owner: 10Chad) [21:53:59] gehel: [21:54:15] yurik, I see ZeroBanner stuff in fatalmonitor [21:54:16] (03Merged) 10jenkins-bot: Moving wikipedias back to wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287759 (owner: 10Chad) [21:55:32] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10StevenJ81) FWIW, I couldn't add one this afternoon (EDT) either. [21:55:39] (03CR) 10Dzahn: [C: 032] remove argon's public IP [dns] - 10https://gerrit.wikimedia.org/r/287764 (https://phabricator.wikimedia.org/T134223) (owner: 10Dzahn) [21:56:09] (03CR) 10Dzahn: "removed from neon / icinga config" [dns] - 10https://gerrit.wikimedia.org/r/287764 (https://phabricator.wikimedia.org/T134223) (owner: 10Dzahn) [21:56:16] mutante: Not sure I follow... [21:56:48] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: pedias back to wmf.23 [21:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:58] gehel: odder is reporting "Search is currently too busy" error, and i was wondering if that could be due to the upgrade [21:57:43] mutante: I don't see why... but that does not mean it is not the case. Let me have a look at a few dashboards... [21:57:57] gehel: thanks! [21:58:16] gilles: And we're back on wmf.23 everywhere. [22:00:17] mutante: response time from elasticsearch have jumped around 21:39 UTC [22:00:47] quite a bit after my last action on the elasticsearch cluster, but it does not mean it is not related ... [22:01:47] gehel: last 1.23 [22:02:07] gehel: arg, i cant type :) what i meant to say was [22:02:18] I wonder if it is related to the MW deploy [22:02:25] mutante: 21:39 UTC is the time the last 1.23 was deployed? [22:02:27] or rather, the revert [22:03:32] hmm, indeed we are suddenly failing a bunch more queries [22:03:38] strange... ebernhardson you have time to have a look with me? Elasticsearch does not look good... [22:03:45] gehel: yea i'll look around [22:03:48] ....Nothing changed from wmf.23 of last week..... [22:03:51] ebernhardson: thanks! [22:03:57] Other than the NavTiming fix we backported. [22:04:21] MaxSem, thx, will look [22:06:26] 06Operations, 10ops-eqiad: decom argon (datacenter) - https://phabricator.wikimedia.org/T134826#2278438 (10Dzahn) [22:07:29] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2278456 (10Dzahn) [22:07:31] 06Operations, 13Patch-For-Review: decom argon - https://phabricator.wikimedia.org/T134223#2258481 (10Dzahn) 05Open>03Resolved done from puppet and DNS point of view. hardware decom in subtask [22:09:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:11:50] (03CR) 10Ppchelko: [C: 031] Change Prop: Tell RESTBase not to respond with redirects [puppet] - 10https://gerrit.wikimedia.org/r/287080 (https://phabricator.wikimedia.org/T134483) (owner: 10Mobrovac) [22:12:17] mutante, ebernhardson: elastic1016 has a system load much, much higher than the rest... [22:12:26] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [22:13:25] (03CR) 10Ppchelko: Change prop: Add the rule for MobileApps re-renders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286847 (owner: 10Mobrovac) [22:14:11] !log restarting elasticsearch on elastic1026 (high load) [22:14:15] gehel: i see. should we restart the service on that one? heh [22:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:43] gehel: thanks, i've been poking graphs and machines but not coming up with any good reasons :S [22:14:43] mutante: I'm on it :P [22:15:03] I took a few threaddumps before restart so I might have something to analyze... [22:16:43] system load is going down on elastic1026, not yet sure that it solves anything... [22:17:29] * odder reports search working again [22:18:12] so now the question is: "what the hell happened"... [22:18:21] seems 1026 was also throwing a bunch of odd messages into logstash about worker shut down [22:19:56] ebernhardson: isn't that correlated with the restart? [22:20:08] 06Operations, 10ops-eqiad: decom argon (datacenter) - https://phabricator.wikimedia.org/T134826#2278467 (10Dzahn) a:05Dzahn>03None [22:20:37] gehel: 14:43? [22:20:47] err, 21:43 UTC [22:21:01] gehel: https://logstash.wikimedia.org/#dashboard/temp/AVSXm-iiDxp7yus2w3ax [22:21:22] 06Operations, 10ops-eqiad: decom argon (datacenter) - https://phabricator.wikimedia.org/T134826#2278438 (10Dzahn) eqiad, row B, B4, @1 purchase date 2011-01-27 [22:21:25] had a GC, then it started spewing errors about unable to reduce search. very odd :S [22:21:35] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 12, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 56, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 152, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_ [22:22:08] ebernhardson: I see those worker shutdown only around 22:14 UTC... [22:22:16] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 12, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 56, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 152, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_ [22:22:26] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 12, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 56, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 152, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_ [22:22:32] ebernhardson: but I see a long GC on elastic1026 at 21:43... [22:22:36] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 12, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 56, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 152, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_ [22:22:45] gehel: oh, the unable to reduce search was it hitting the queue limit. seems right after the long gc it started spewing those [22:23:05] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 12, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 56, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 152, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_ [22:23:35] and logstash recovers just after we restart the search cluster? Now I'm really worried... [22:25:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:26:13] gehel: most of the percentiles seem to be recovering, query failures look to have stopped at 22:14 to match your restart [22:26:40] ebernhardson: yep, it seems that it is recovering... [22:27:12] but the question is how that server went bad :S [22:27:37] gehel: you are probably safe to call it a night. I'll start a ticket and ping dcausse too [22:27:51] ebernhardson: Thanks! I'll get some sleep... [22:28:53] ebernhardson: looking at the GC graphs, it seems that it is the young generation that took time. That's probably an indication of a very high memory throughput, more than a very high memory utilization. We should have a look at the GC logs (if we keep those) [22:29:35] * gehel is going to get some sleep. For real! [22:29:44] thanks gehel [22:29:53] mutante: np! [22:31:30] * ebernhardson also wonders why the icinga trigger on prefix search p50 > 150ms didn't trigger ... [22:32:48] probaly because it relies on getting data from graphite and something there fails when reading it [22:33:54] mutante: doh, yes. The trigger hasn't been updated for some stats that moved ... [22:36:21] That's the question I was going to ask: why didn't we get an alert... [22:36:54] * gehel is going to sleep, promise... [22:41:00] twentyafterfour: too bad this isn't implemented https://secure.phabricator.com/T8092 :/ [22:41:15] twentyafterfour: do you perhaps know of any estimate for that one? [22:41:34] twentyafterfour: I was also reading https://secure.phabricator.com/T10691#167705 [22:43:16] paravoid: looking [22:44:03] paravoid: the virtualized git refs stuff is mostly viable, in fact we can get it to work by generating a revision from a branch [22:45:00] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2278779 (10hoo) >>! In T134017#2277957, @Krenair wrote: > Pretty sure I ran the site table update script. @hoo/@aude ? On all Wikidata clients... [22:45:08] e.g. you would push a new branch named review/your_branch_name and it would automatically generate a new differential revision from that (there is code to do this already, just need to set up an automation job to do it) [22:46:17] are you talking about the post-receive hook? [22:46:22] that sounded pretty ugly [22:46:54] no not a hook, a polling job that watches for new refs and runs async, separate from the git service [22:47:40] paravoid: I'm talking about https://phabricator.wikimedia.org/T132863 [22:48:26] ah! [22:48:30] hadn't seen that one [22:48:50] I wonder how access controls would work, though [22:49:47] well, in what sense? controlling what can merge? Phabricator actually has access controls for automated merges already, using herald rules and projects for access groups [22:50:09] I don't know if I would have arcyd do the merges, in our setup we'd probably do that a little differently than bloomberg does it [22:56:32] still, the native phab stuff sound way better :) [22:56:35] way more gerrit-like [23:00:04] RoanKattouw ostriches Krenair Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160509T2300). [23:00:04] yurik jdlrobson RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] yep [23:00:18] here [23:00:26] although only here till 4.30pm [23:00:36] hoping that can be accommodated! :) [23:00:49] I can do it [23:00:51] Let's go [23:01:20] I'm here [23:01:48] jdlrobson: I'll do your 2 first. [23:02:01] (03PS5) 10Chad: Enable lazy loaded images on bengali wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [23:02:09] (03CR) 10Chad: [C: 032] Enable lazy loaded images on bengali wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [23:02:39] (03Merged) 10jenkins-bot: Enable lazy loaded images on bengali wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287681 (https://phabricator.wikimedia.org/T134768) (owner: 10Jdlrobson) [23:04:41] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: lazy load images in MF for bnwiki (duration: 00m 26s) [23:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:10] jdlrobson: First one ^^^ [23:05:18] ostriches: on it! [23:05:49] 06Operations, 10DNS, 10Traffic, 10Wiki-Loves-Monuments-General, and 2 others: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1801065 (10Platonides) wikilovesmonument**s**.org is a bit special for being the one for the international team, but note that for many domains (eg... [23:05:51] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:06:00] (03PS4) 10Chad: Enable $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286505 (https://phabricator.wikimedia.org/T134115) (owner: 10Brion VIBBER) [23:06:20] 06Operations, 10DNS, 10Traffic, 10Wiki-Loves-Monuments-General, and 2 others: point wikilovesmonuments.org ns to wmf - https://phabricator.wikimedia.org/T118468#2278838 (10Platonides) [23:06:45] ostriches: looks good to me! [23:06:45] (03CR) 10Chad: [C: 032] Enable $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286505 (https://phabricator.wikimedia.org/T134115) (owner: 10Brion VIBBER) [23:06:51] Ok awesome, now the 2nd. [23:07:38] (03Merged) 10jenkins-bot: Enable $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286505 (https://phabricator.wikimedia.org/T134115) (owner: 10Brion VIBBER) [23:08:23] !log demon@tin Synchronized wmf-config/mobile.php: Enable $wgMFStripResponsiveImages (duration: 00m 27s) [23:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:33] jdlrobson: And that now ^^ [23:09:09] ostriches: looks good! [23:09:10] thanks! [23:09:13] (03PS3) 10Chad: Removed obsolete graph ext settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287299 (owner: 10Yurik) [23:09:13] <3 [23:09:17] yw :) [23:09:18] (03CR) 10Platonides: [C: 031] added SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [23:09:34] yurik: You're next [23:10:06] (03CR) 10Chad: [C: 032] Removed obsolete graph ext settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287299 (owner: 10Yurik) [23:10:45] (03Merged) 10jenkins-bot: Removed obsolete graph ext settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287299 (owner: 10Yurik) [23:10:56] ostriches, all here :) [23:11:11] mutante, you killed argon without removing the config reference to it? [23:11:30] in ProductionServices.php [23:12:02] (03CR) 10Aaron Schulz: [C: 04-1] Configure 'testwiki' as foreign file repo for 'test2wiki', allow cross-wiki uploads (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:12:25] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#1758902 (10Platonides) @scfc in that case, your server should not reject based on SPF for accounts that you forwared there. There are many ways to treat SPF, the... [23:12:41] !log demon@tin Synchronized wmf-config/: Obsolete graph settings (duration: 00m 29s) [23:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:58] yurik: And done ^ [23:13:07] (03PS2) 10Chad: Enable Flow opt-in beta feature on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287730 (https://phabricator.wikimedia.org/T132693) (owner: 10Catrope) [23:13:15] RoanKattouw: Last and not least.... [23:13:31] (03CR) 10Chad: [C: 032] Enable Flow opt-in beta feature on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287730 (https://phabricator.wikimedia.org/T132693) (owner: 10Catrope) [23:13:33] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2278863 (10Krenair) I think I ran the script on aawiki with `--site-group wikipedia --force-protocol https` [23:13:51] ostriches, seems graphs are still there, so should be ok :) [23:14:36] (03Merged) 10jenkins-bot: Enable Flow opt-in beta feature on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287730 (https://phabricator.wikimedia.org/T132693) (owner: 10Catrope) [23:15:34] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: nowiki flow beta (duration: 00m 26s) [23:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:49] RoanKattouw: ^^^^ [23:15:56] Yup, looking [23:16:05] (03PS1) 10Alex Monk: Cleanup IRC switchover from argon to kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287797 [23:16:31] ostriches: Looks great, tahnks [23:16:46] :D [23:17:04] Yay, swat's over. Thanks for playing! [23:17:18] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2278890 (10hoo) >>! In T134017#2278863, @Krenair wrote: > I think I ran the script on aawiki with `--site-group wikipedia --force-protocol https... [23:17:31] (03PS2) 10Aaron Schulz: Configure 'testwiki' as foreign file repo for 'test2wiki', allow cross-wiki uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:21:30] (03PS1) 10Brion VIBBER: Remove old config hack that disabled $wgResponsiveImages on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287818 [23:25:41] hmm, did etherpad just crash ? [23:26:50] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2278919 (10EBernhardson) [23:27:38] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2278932 (10EBernhardson) [23:29:03] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2278938 (10Krenair) I don't think addWiki runs it across all wikis. I wouldn't be comfortable with it doing that anyway given the potential for... [23:29:41] thedj, WFM [23:30:06] weird https://etherpad.wikimedia.org/p/CREDIT doesn't open for me [23:30:23] do others though? [23:30:27] opens for me [23:30:29] nope [23:30:35] great... [23:30:41] switch browser, starts working.. [23:32:13] (03PS3) 10Aaron Schulz: Configure 'testwiki' as foreign file repo for 'test2wiki', allow cross-wiki uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:32:45] 06Operations, 10ops-ulsfo: ulsfo planned maintenance on 2016-05-11 - https://phabricator.wikimedia.org/T134831#2278956 (10RobH) [23:33:31] thedj: i bet you the browser was chrome [23:34:06] (03PS4) 10Aaron Schulz: Configure 'testwiki' as foreign file repo for 'test2wiki', allow cross-wiki uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:37:12] MatmaRex: ok, testwiki is added first to the array now [23:37:18] (before commons) [23:37:21] AaronSchulz: eurgh, that's a lot of configuration, i didn't expect that this'll be so messy, thank you [23:37:36] (03CR) 10Dzahn: [C: 031] Cleanup IRC switchover from argon to kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287797 (owner: 10Alex Monk) [23:37:42] lots of it just repeated from other stuff...maybe more templating can be used [23:38:59] AaronSchulz: i'll schedule that for a SWAT tomorrow if that's okay with you? or do you want to be around when it's deployed / deploy it yourself? [23:39:22] tomorrow is fine (in the evening) [23:39:25] (03CR) 10Dzahn: "rm self, +1 is still valid" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [23:40:40] (03CR) 10Dzahn: "rm self, +1 still valid" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [23:43:53] (03PS3) 10Dzahn: piwik: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285309 [23:45:23] 06Operations, 10DBA, 07Performance, 07RfC, 05codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2279017 (10aaron) Read/write exceptions are already supposed to be caught in handleReadError()/handleWriteError(). Are there backtraces of ex... [23:47:43] (03CR) 10Dzahn: "not used in labs (https://tools.wmflabs.org/watroles/role/role::piwik)" [puppet] - 10https://gerrit.wikimedia.org/r/285309 (owner: 10Dzahn) [23:50:17] (03PS4) 10Dzahn: piwik: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285309 [23:52:10] (03CR) 10Dzahn: [C: 032] "no-op (except motd) http://puppet-compiler.wmflabs.org/2710/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/285309 (owner: 10Dzahn) [23:57:07] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132321 (10greg) >>! In T130317#2263058, @ori wrote: > @mmodell, blocking this on porting l10nupdate to scap doesn't seem reasonable. Could you sim... [23:58:18] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2279063 (10mmodell) @greg: sure @ori: sorry I missed this before. [23:59:36] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2259756 (10Dzahn) There is no single admin group "sca" or "scb". These are collections of services and admin groups. sca means: - cxserver-admin - zotero-admin - apertium-admins scb mean...