[00:00:07] * subbu reads backlog [00:01:02] * subbu sees that all is well again in the graphs [00:01:11] (03CR) 10Dzahn: [C: 032] dbtree: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230693 (owner: 10Dzahn) [00:02:18] !log Resumed convertLqtPageOnLocalWiki.php run on MediaWiki.org's Project:Support_desk. [00:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:06] matt_flaschen: ooooh, are you converting LQT to Flow? [00:03:43] (03CR) 10Dzahn: "Notice: /Stage[main]/Apache/Service[apache2]: Triggered 'refresh' from 1 events" [puppet] - 10https://gerrit.wikimedia.org/r/230693 (owner: 10Dzahn) [00:04:03] (03PS3) 10Dzahn: kibana: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230692 [00:04:37] ori, yep, this is the last page on MediaWiki.org (except one trivial one I'll do next). Support desk is 2/3 done, but it had a memory leak (which was only caught on the support desk since it's huge), so I had to kill it. But I fixed the memory leak, so restarting. [00:05:07] matt_flaschen: \o/ [00:06:10] !log ori@tin Synchronized php-1.26wmf18/extensions/Echo: Updated mediawiki/core Project: mediawiki/extensions/Echo 3ab0b7e0f4948 (duration: 00m 12s) [00:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:51] !log ori@tin Synchronized php-1.26wmf17/extensions/Echo: Updated mediawiki/core Project: mediawiki/extensions/Echo 32e5bcf90c702 (duration: 00m 13s) [00:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:51] (03CR) 10Dzahn: "already runs on 2.4 but did not break because of: access_compat.load -> ../mods-available/access_compat.load" [puppet] - 10https://gerrit.wikimedia.org/r/230692 (owner: 10Dzahn) [00:09:04] oh, gitblit is down again? [00:09:48] !log restarted gitblit [00:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:00] subbu: hopefully back soon as usual [00:10:11] mutante, thanks. [00:10:18] yw [00:12:36] (03PS4) 10Dzahn: kibana: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230692 [00:14:17] subbu: wfm now [00:15:05] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61429 bytes in 0.247 second response time [00:15:25] mutante, ok. rechecking the affect patch. should go through i think. [00:17:38] (03PS5) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [00:18:03] (03CR) 10Dzahn: "rebased on top of what is already done" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [00:18:29] (03PS6) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [00:20:15] (03CR) 10Dzahn: [C: 031] Kill ee-prototype.wikipedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [00:20:45] PROBLEM - puppet last run on mw2134 is CRITICAL puppet fail [00:21:09] mutante, what will it take to get someone to +2 that? :/ [00:22:05] !log updated kartotherian [00:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:15] (03CR) 10Dzahn: "adding Giuseppe to confirm or deny the hiera question" [puppet] - 10https://gerrit.wikimedia.org/r/230549 (https://phabricator.wikimedia.org/T108610) (owner: 10Yurik) [00:23:33] (03CR) 10Yurik: "Please check with akosiaris - he was doing something with passwords" [puppet] - 10https://gerrit.wikimedia.org/r/230549 (https://phabricator.wikimedia.org/T108610) (owner: 10Yurik) [00:25:16] (03CR) 10Dzahn: Add ferm rules for Logstash/Elasticsearch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [00:29:43] (03PS3) 10Dzahn: Add ferm rules for Logstash/Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [00:30:20] (03CR) 10Dzahn: "PS2: did what Alex commented on" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [00:33:58] (03CR) 10Dzahn: "i'll leave it to Chase and Yuvipanda" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T101235) (owner: 10WMDE-leszek) [00:34:33] (03PS2) 10Dzahn: fermium: override role default IPs [puppet] - 10https://gerrit.wikimedia.org/r/230240 (https://phabricator.wikimedia.org/T108080) (owner: 10John F. Lewis) [00:34:39] (03CR) 10jenkins-bot: [V: 04-1] fermium: override role default IPs [puppet] - 10https://gerrit.wikimedia.org/r/230240 (https://phabricator.wikimedia.org/T108080) (owner: 10John F. Lewis) [00:34:50] (03CR) 10Dzahn: [C: 04-2] fermium: override role default IPs [puppet] - 10https://gerrit.wikimedia.org/r/230240 (https://phabricator.wikimedia.org/T108080) (owner: 10John F. Lewis) [00:42:14] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [00:42:41] (03PS1) 10Mattflaschen: Change login cookies (for 'Remember me') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) [00:46:58] (03CR) 10Mattflaschen: [C: 04-2] "-2 until:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [00:48:03] (03PS1) 10Dzahn: elasticsearch: add cluster hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/230955 (https://phabricator.wikimedia.org/T104962) [00:49:14] (03CR) 10Dzahn: "need to add them first like so afaict:" [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) (owner: 10Muehlenhoff) [00:49:19] 7Blocked-on-Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-General-or-Unknown, 5MW-1.26-release, and 2 others: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis - https://phabricator.wikimedia.org/T68699#1530184 (10Mattflaschen) Patch is up. Should be merged... [00:50:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:51:53] (03PS3) 10Dzahn: ferm rules for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) (owner: 10Muehlenhoff) [00:54:45] (03PS1) 10Faidon Liambotis: Add msw1-codfw and mr1-codfw to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/230956 [00:56:00] (03CR) 10Faidon Liambotis: [C: 032] Add msw1-codfw and mr1-codfw to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/230956 (owner: 10Faidon Liambotis) [00:59:04] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:09:14] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:12:09] (03PS1) 10Dzahn: varnish-misc: remove backend zirconium [puppet] - 10https://gerrit.wikimedia.org/r/230959 (https://phabricator.wikimedia.org/T105510) [01:14:38] (03CR) 10Dzahn: [C: 032] "not used anymore" [puppet] - 10https://gerrit.wikimedia.org/r/230959 (https://phabricator.wikimedia.org/T105510) (owner: 10Dzahn) [01:16:13] (03CR) 10Dzahn: "scheduled downtime, removed varnish backend entirely" [dns] - 10https://gerrit.wikimedia.org/r/230830 (https://phabricator.wikimedia.org/T105510) (owner: 10Dzahn) [01:18:34] 6operations, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530258 (10Dzahn) also need to update https://wikitech.wikimedia.org/wiki/Zirconium and new page for the hosts that now have include those roles [01:18:44] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:22:51] !log zirconium - shut down, i'm sure, mollyguard [01:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:25:41] (03CR) 10Dzahn: "there is still an entire wiki here, doesn't that need more shutdown steps first if it is to be killed?" [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [01:27:06] (03CR) 10Alex Monk: "Surely the apache config needs to go first, then the mediawiki config, then the CA entries? Am I missing any steps?" [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [01:28:53] (03CR) 10Dzahn: "probably last the DNS entries somehow within labs" [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [01:29:05] (03PS2) 10Dzahn: Kill ee-prototype.wikipedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [01:30:08] (03CR) 10Dzahn: [C: 032] "per hashar/catrope/greg on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [01:30:13] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1530270 (10Tgr) Would it be difficult to change that? (Also, is it documented somewhere what those t... [01:32:59] PROBLEM - Router interfaces on mr1-codfw is CRITICAL host 208.80.153.196, interfaces up: 36, down: 6, dormant: 0, excluded: 0, unused: 0BRge-0/0/3: down - BRge-0/0/6: down - BRge-0/0/5: down - BRge-0/0/4: down - BRge-0/0/7: down - BRge-0/0/1: down - BR [01:55:58] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [01:57:48] 7Blocked-on-Operations, 10Beta-Cluster, 6Collaboration-Team-Backlog, 5Patch-For-Review: Decide what to do with ee_prototypewiki in beta - https://phabricator.wikimedia.org/T107397#1530342 (10Dzahn) Daniel Zahn: there is still an entire wiki here, doesn't that need more shutdown steps first if it is to be k... [01:58:00] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [02:00:34] (03Abandoned) 10Dzahn: enable firewalling on tin [puppet] - 10https://gerrit.wikimedia.org/r/229151 (owner: 10Dzahn) [02:00:47] (03PS3) 10Dzahn: librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) [02:01:45] (03CR) 10jenkins-bot: [V: 04-1] librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [02:01:53] 6operations, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530350 (10Dzahn) 18:24 < mutante> !log zirconium - shut down, i'm sure, mollyguard updated wikitech, removed from icinga [02:04:25] (03PS4) 10Dzahn: librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) [02:05:48] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:06:03] (03CR) 10Dzahn: "untested" [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [02:12:22] (03CR) 10Dzahn: "Andrew, Filippo: done, added to eqiad/codfw instead. but are there really other openstack hosts there?" [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [02:18:10] mutante, in the spirit of https://gerrit.wikimedia.org/r/#/c/223458/5 - is there a way to keep hieradata\/hosts\/(tin|mira)\.yaml in the same place? [02:18:25] at least the admin::groups part? [02:18:31] Krenair: we can use regex.yaml [02:18:35] heh [02:19:46] mutante, okay, but seriously? [02:20:01] yes [02:20:08] oh wait, you weren't joking [02:20:11] that's actually a file [02:20:15] yes:) [02:20:18] :D [02:20:32] we should do something like: [02:20:35] deployment-servers: [02:20:39] __regex: [02:20:44] admin-groups:.. [02:20:58] and probably same for bastions [02:21:11] not hieradata/role/common/deployment-server.yaml or something? [02:21:49] that should work too, but only if they use the special "role" keyword to include the role [02:21:56] bastions are too inconsistent at the moment [02:22:04] i know, i have pending patches for that [02:22:28] bast1001 allows everyone, hooft allows ops+restricted+deployers, bast[24]001 allows ops only AFAICT [02:22:42] please review: [02:22:44] https://gerrit.wikimedia.org/r/#/c/222519/ [02:22:49] https://gerrit.wikimedia.org/r/#/c/222522/ [02:23:33] and iron doesn't count [02:23:57] what about bast4001 in ulsfo? still inconsistent... [02:24:22] will need another patch [02:24:31] maybe we dont even need iron anymore [02:24:38] since we stopped allowign agent forwarding [02:24:48] same with bastion-restricted [02:25:02] Pretty sure iron is trusted in a bunch of interesting places [02:25:19] mysql grants for one [02:25:28] likely firewall rules as well [02:25:52] firewall rules come from role/bastion as it should be [02:25:59] and maybe mgmt? I have no idea about that stuff [02:26:01] well, there is just one, open 22 [02:26:30] eh, and the defaults that come from standard [02:26:40] !log l10nupdate@tin Synchronized php-1.26wmf17/cache/l10n: l10nupdate for 1.26wmf17 (duration: 06m 52s) [02:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:59] even mysql grants should be in puppet repo nowadays [02:27:02] i saw some [02:27:06] haha [02:27:24] keyword is very probably 'should' [02:27:30] templates/mariadb/production-grants.sql.erb: [02:30:08] (03PS2) 10Dzahn: openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 [02:30:14] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf17) at 2015-08-12 02:30:14+00:00 [02:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:48] (03CR) 10Dzahn: "better now?" [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [02:30:50] (03CR) 10jenkins-bot: [V: 04-1] openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [02:32:17] (03PS3) 10Dzahn: openstack firewall: get designate host from hiera [puppet] - 10https://gerrit.wikimedia.org/r/201880 [02:37:01] (03PS1) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [02:37:58] Anyone know if it's possible to get xhprof memory usage data for a batch job? [02:41:08] (03PS1) 10Dzahn: deployment: include admin roles in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 [02:41:28] (03PS2) 10Dzahn: deployment: include admin groups in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 [02:43:23] (03PS2) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [02:43:48] (03PS3) 10Dzahn: deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 [02:44:35] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530417 (10Dzahn) [02:46:02] (03CR) 10Dzahn: [C: 032] decom zirconium [dns] - 10https://gerrit.wikimedia.org/r/230830 (https://phabricator.wikimedia.org/T105510) (owner: 10Dzahn) [02:47:46] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530423 (10Dzahn) server is removed from DNS and shutdown please wipe disks and add to the "spares" wiki page / decide about reclaim or final decom [02:48:14] 6operations, 10ops-eqiad: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530424 (10Dzahn) [02:48:57] 6operations, 10ops-eqiad: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530425 (10Dzahn) a:5Dzahn>3None [02:58:05] 6operations, 10ops-eqiad: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1530441 (10Dzahn) added to spares page. it's a X5647 processor https://wikitech.wikimedia.org/w/index.php?title=Server_Spares&type=revision&diff=173763&oldid=173615 [02:58:16] !log l10nupdate@tin Synchronized php-1.26wmf18/cache/l10n: l10nupdate for 1.26wmf18 (duration: 11m 13s) [02:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:33] 6operations, 10ops-eqiad: reclaim zirconium (wipe disk) - https://phabricator.wikimedia.org/T105510#1530442 (10Dzahn) [03:02:00] (03CR) 10Alex Monk: [C: 031] "Seems like a good idea for the bastions to be consistent" [puppet] - 10https://gerrit.wikimedia.org/r/222522 (owner: 10Dzahn) [03:02:07] 10Ops-Access-Requests, 6operations: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1530449 (10Dzahn) a:3ArielGlenn [03:02:59] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1530452 (10Dzahn) [03:03:16] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1530453 (10Dzahn) p:5Triage>3Normal [03:03:38] mutante, the only thing 'pending setup' now about mira is scap development, right? [03:04:47] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf18) at 2015-08-12 03:04:47+00:00 [03:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:29] PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL - Unit is in state activating [03:06:21] !log Installing xdebug on terbium so matt_flaschen can debug memory leak [03:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:57] matt_flaschen: you should have xdebug now [03:14:58] (03CR) 10Alex Monk: [C: 031] deployment servers: use role keyword for role [puppet] - 10https://gerrit.wikimedia.org/r/230965 (owner: 10Dzahn) [03:15:09] (03PS3) 10Alex Monk: deployment: include admin groups in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 (owner: 10Dzahn) [03:15:26] (03CR) 10Alex Monk: [C: 031] deployment: include admin groups in role, not nodes [puppet] - 10https://gerrit.wikimedia.org/r/230966 (owner: 10Dzahn) [03:17:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:20:16] 6operations, 6Discovery, 10MediaWiki-Search, 7Monitoring: Search service monitoring should fail if search results only return exact matches and suggestions don't work - https://phabricator.wikimedia.org/T101914#1530458 (10Deskana) >>! In T101914#1526896, @ArielGlenn wrote: > it looks like they want to chec... [03:27:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:32:49] (03PS3) 10Alex Monk: beta: delete ee_prototypewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229608 (https://phabricator.wikimedia.org/T107397) [03:33:30] (03CR) 10Alex Monk: [C: 032] beta: delete ee_prototypewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229608 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [03:33:36] (03Merged) 10jenkins-bot: beta: delete ee_prototypewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229608 (https://phabricator.wikimedia.org/T107397) (owner: 10Alex Monk) [03:34:41] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/229608 (duration: 00m 12s) [03:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:34:54] (all other files were -labs) [03:37:46] 7Blocked-on-Operations, 10Beta-Cluster, 6Collaboration-Team-Backlog, 5Patch-For-Review: Decide what to do with ee_prototypewiki in beta - https://phabricator.wikimedia.org/T107397#1530466 (10Krenair) 5Open>3Resolved ```mysql> DELETE FROM localnames WHERE ln_wiki='ee_prototypewiki'; Query OK, 2998 rows... [03:38:32] 7Blocked-on-Operations, 10Beta-Cluster, 6Collaboration-Team-Backlog, 5Patch-For-Review: Decide what to do with ee_prototypewiki in beta - https://phabricator.wikimedia.org/T107397#1530468 (10Krenair) Like the production wikis in deleted.dblist, the database for this site remains intact but of course is ina... [03:45:38] 10Ops-Access-Requests, 6operations: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1530470 (10Deskana) @tfinc can approve this, if it requires manager approval. [03:46:39] RECOVERY - Last backup of the others filesystem on labstore1002 is OK - Last run successful [04:18:45] (03PS1) 10Ottomata: 0.8.2.1-3 release - fix for snappy 1.1.1.6 bug [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/230970 [04:21:20] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet last ran 10 hours ago [04:36:00] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:45:50] !log starting slow "apt-get -y upgrade" on cp* (mostly, nginx -> +wmf2), will execute over ~18-24h [04:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:54:12] 7Blocked-on-Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-General-or-Unknown, 5MW-1.26-release, and 2 others: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis - https://phabricator.wikimedia.org/T68699#1530538 (10Mattflaschen) I got an email from @Mpaulson,... [05:25:33] (03PS1) 10Dzahn: admins: add tjones to restricted [puppet] - 10https://gerrit.wikimedia.org/r/230974 (https://phabricator.wikimedia.org/T108696) [05:30:02] (03Restored) 10Dzahn: enable firewalling on tin [puppet] - 10https://gerrit.wikimedia.org/r/229151 (owner: 10Dzahn) [05:37:12] (03PS1) 10Dzahn: admins: delete mailman-users group [puppet] - 10https://gerrit.wikimedia.org/r/230977 [05:39:52] (03PS3) 10Dzahn: pay-lvs: remove from hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/219077 [05:48:42] (03CR) 10Matanya: [C: 031] admins: delete mailman-users group [puppet] - 10https://gerrit.wikimedia.org/r/230977 (owner: 10Dzahn) [05:54:18] !log Killed support desk conversion. Will resume with profiling tomorrow. [05:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:12:32] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 12 06:12:31 UTC 2015 (duration 12m 30s) [06:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:13:05] 7Blocked-on-Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-General-or-Unknown, 5MW-1.26-release, and 2 others: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis - https://phabricator.wikimedia.org/T68699#1530679 (10Legoktm) Are we doing this just on SUL wikis... [06:23:15] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1530704 (10Tau) I saved [[ https://raw.githubusercontent.com/wikimedia/mediawiki/f3d821de4e1f79b4c4b73e44b38d1c19830f5a9c/inc... [06:24:31] (03CR) 10Matanya: "So maybe drop the compatibility layer and all the if's and just make it 2.4 native ?" [puppet] - 10https://gerrit.wikimedia.org/r/230692 (owner: 10Dzahn) [06:31:19] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on cp2013 is CRITICAL puppet fail [06:31:38] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on mw2021 is CRITICAL Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 2 failures [06:32:49] PROBLEM - puppet last run on db1046 is CRITICAL Puppet has 1 failures [06:33:28] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:33:39] PROBLEM - puppet last run on eventlog2001 is CRITICAL Puppet has 1 failures [06:57:08] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on db1046 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:40] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:58:09] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:58:20] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:58:39] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:09] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:38] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530743 (10Arrbee) [07:07:57] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530745 (10Arrbee) p:5Unbreak!>3High [07:15:49] godog: or akosiaris can you restart apertium-apy in sca1001/sca1002? Thanks! [07:17:37] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530784 (10KartikMistry) [07:22:33] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1530842 (10Arrbee) [07:23:10] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1530843 (10KartikMistry) Also, if I able to *restart* the process, that will be awesome! :) [07:29:17] <_joe_> kart_: why should we restart it? [07:31:29] _joe_: MT is not working at all. [07:31:58] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530887 (10Arrbee) p:5High>3Unbreak! [07:34:04] <_joe_> what is mt? [07:34:43] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530893 (10Joe) @Arrbee can I understand why you raised the priority o... [07:34:54] _joe_: Machine Translation (ie Apertium) [07:35:11] <_joe_> kart_: I am not willing to restart it without understanding first what's wrong [07:35:40] <_joe_> going to take a look [07:35:56] https://cxserver.wikimedia.org/translation/ [07:36:12] Click on Start (health check) [07:36:23] also, it fails in translation of article. [07:36:32] <_joe_> ok, it is failing [07:36:42] <_joe_> why is it failing is what I want to understand [07:37:25] <_joe_> quite interstingly, no log files in /var/log/apertium [07:39:07] <_joe_> accept4(5, 0x7fff5599e4b0, [16], SOCK_CLOEXEC) = -1 EMFILE (Too many open files) [07:39:26] _joe_: we added -j1 and -m300 [07:39:37] I'm not sure service was restarted after it. [07:40:01] So, it is apertium-apy issue somewhere. [07:41:28] <_joe_> /usr/bin/python3 /usr/share/apertium-apy/servlet.py -j1 -m300 /usr/share/apertium/modes [07:41:31] <_joe_> so yes [07:41:33] <_joe_> but [07:43:39] so, was it restarted? [07:43:57] <_joe_> no [07:44:07] <_joe_> I am still trying to build a decent bug report [07:44:21] There is one :) [07:44:31] https://phabricator.wikimedia.org/T107270 [07:45:24] <_joe_> no, I am actually adding info there. You know, now I found out what the problem is, so that we don't need to restart the damn thing again tomorrow [07:46:03] <_joe_> this is btw why I am reluctant in giving around restart rights, people must understand that problems must be fixed in the right way, or they will come back [07:46:10] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530904 (10Joe) btw, with the new settings we incur in a too many open... [07:46:59] <_joe_> !log restarted apertium-apy on sca1001 and sca1002, too many open files, probably leaking [07:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:40] _joe_: thanks! [07:47:44] MT is back now. [07:52:32] (03PS1) 10Giuseppe Lavagetto: apertium: raise the open files limit [puppet] - 10https://gerrit.wikimedia.org/r/230983 [07:52:39] <_joe_> kart_: it will be even better once this ^^ is merged [07:53:41] (03PS2) 10Giuseppe Lavagetto: apertium: raise the open files limit [puppet] - 10https://gerrit.wikimedia.org/r/230983 (https://phabricator.wikimedia.org/T107270) [07:59:40] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1530914 (10Joe) @KartikMistry I have just checked and /var/log/apertium is empty, so we must be missing something there. Also, for rest... [08:00:04] <_joe_> kart_: so, why don't we have an healtcheck for apertium directly into our monitoring infrastructure? [08:00:07] <_joe_> meh [08:02:12] _joe_: that's plan. there should be some notification when apertium-apy fails. [08:02:31] _joe_: I will work with akosiaris on this. [08:04:01] 10Ops-Access-Requests, 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1530929 (10KartikMistry) @Joe only after debugging for sure! [08:07:20] _joe_: I've added about health check into Apertium service, https://phabricator.wikimedia.org/T108798 [08:09:48] (03CR) 10KartikMistry: [C: 031] apertium: raise the open files limit [puppet] - 10https://gerrit.wikimedia.org/r/230983 (https://phabricator.wikimedia.org/T107270) (owner: 10Giuseppe Lavagetto) [08:25:20] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [08:26:25] (03CR) 10Muehlenhoff: [C: 04-1] "The rsync server running on port 873 doesn't seem to be covered?" [puppet] - 10https://gerrit.wikimedia.org/r/229151 (owner: 10Dzahn) [08:54:30] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:00:58] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 3 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530982 (10Unhammer) "the new settings" being just -j1 -m300? Also, wh... [09:03:14] hey, can someone check if nutcracker is working correctly on mw1123? Getting strange API errors for that host and nutcracker was responsible for that the last time [09:03:17] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 3 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1530993 (10KartikMistry) @Unhammer We haven't updated Apertium-APY yet... [09:43:07] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 3 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1531049 (10Unhammer) You might want to try running ``` tools/sanity... [09:44:13] (03PS2) 10Tobias Gritschacher: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek) [09:52:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 5 below the confidence bounds [10:04:15] 10Ops-Access-Requests, 6operations: Add ppchelko to the staff groups - https://phabricator.wikimedia.org/T108805#1531096 (10mobrovac) 3NEW a:3ArielGlenn [10:05:13] (03CR) 10Alexandros Kosiaris: [C: 032] apertium: raise the open files limit [puppet] - 10https://gerrit.wikimedia.org/r/230983 (https://phabricator.wikimedia.org/T107270) (owner: 10Giuseppe Lavagetto) [10:06:16] <_joe_> thanks AKPWD [10:06:20] <_joe_> err akosiaris [10:06:37] _joe_: thanks! [10:14:04] (03CR) 10Alexandros Kosiaris: [C: 031] 0.8.2.1-3 release - fix for snappy 1.1.1.6 bug [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/230970 (owner: 10Ottomata) [10:16:22] (03CR) 10John F. Lewis: [C: 031] "Since sudo was approved, this is indeed irrelevant. This only existed to get shell access before my holiday which Faidon suggested. Kill i" [puppet] - 10https://gerrit.wikimedia.org/r/230977 (owner: 10Dzahn) [10:19:56] <_joe_> akosiaris: oh you're packaging java apps today? what about a nice visit to the dentist later, to cheer you up a bit? [10:20:14] (03CR) 10Alexandros Kosiaris: [C: 031] Add ferm rules for Logstash/Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [10:24:16] (03PS1) 10Jcrespo: Depool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230989 [10:25:22] (03CR) 10Jcrespo: [C: 032] Depool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230989 (owner: 10Jcrespo) [10:27:19] !log jynus@tin Synchronized wmf-config/db-codfw.php: depool db2040 (duration: 00m 11s) [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:50] 10Ops-Access-Requests, 6operations: Add ppchelko to the staff groups - https://phabricator.wikimedia.org/T108805#1531138 (10ArielGlenn) 5Open>3Resolved added. root@terbium:~# ldaplist -l group wmf | grep pch member: uid=ppchelko,ou=people,dc=wikimedia,dc=org [10:28:04] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1531141 (10akosiaris) So this was due to a config change back in Jul 19th. https://phabricator.wikimedia.org/rOPUP5a7adbb4a5be5f49633efd778879f284fe6f8af9 I did not propagate it to the slaves a... [10:29:36] 6operations: monitor postgres replication - https://phabricator.wikimedia.org/T108806#1531147 (10akosiaris) 3NEW [10:32:48] _joe_: I am actually planning one [10:32:50] :P [10:35:33] (03PS1) 10Aude: Enable arbitrary access on dewiki, frwiki, jawiki and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230990 (https://phabricator.wikimedia.org/T100787) [10:49:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] Introduce new labs role for vagrant+lxc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [11:01:49] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [11:20:17] (03PS14) 10Giuseppe Lavagetto: puppet-compiler: first commit [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) [11:24:09] akosiaris, thanks for figuring it out! do you know why 2 out of 4 machines could consistently be having lower cpu usage? https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=maps+Cluster+codfw&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=2&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [11:25:24] yurik: what's up with all the osmimporter users ? [11:25:33] I see most postgres usage by that user [11:25:41] akosiaris, i'm manually running the gen job [11:25:48] on 4 machines [11:26:20] once the tilerator service & accounts are in place, i will switch to that [11:26:33] then my first guess is the gen job does not balance well enough ? [11:26:51] I see 2 boxes consistently at 100% usage [11:26:59] which also means the gen job needs to throttled [11:27:24] i can do that easily in the future - by lowering the number of tilerator instances [11:27:27] currently - at 4 [11:27:46] and each is doing work in 2 node-js style "threads" [11:27:56] <_joe_> threads? [11:27:56] 6operations, 7Monitoring: monitor postgres replication - https://phabricator.wikimedia.org/T108806#1531231 (10Aklapper) [11:28:02] <_joe_> what do you mean by that? [11:28:21] async operations, started at the same time [11:28:30] non-preemptive multithreading [11:28:40] <_joe_> no, that is _not_ multithreading at all [11:28:50] <_joe_> it's an event loop with async io [11:28:54] that's why i put quotes around it :) [11:29:18] <_joe_> so if one of your "threads" uses one cpu maxing it out, you have serial processing [11:29:29] feel free to come up with a better name for them [11:29:33] <_joe_> I guess this is not the case [11:29:43] postgress is doing it in multithreads [11:29:45] <_joe_> yurik: callbacks ;) [11:29:59] rrright - "i have two callbacks running at the same time" [11:30:02] <_joe_> oh yeah postgres is a real software :) [11:30:21] oh these callbacks are being blocked on waiting for postgres to finish the select it is running [11:30:31] yep [11:30:48] and postgres is doing a lot of heavy lifting with geo-queries [11:30:49] so they yield and everuthing moves at a pace that appears to be concurent [11:30:56] exactly :) [11:31:04] hence "threads" [11:31:20] yeah, multihreading is not concurrency but anyway [11:31:27] those are philosophical things [11:31:36] so .. there are many many selects as I see it [11:31:41] way more than 8 [11:31:45] * yurik welcomes a better solution to the biggest problem in programming -- better naming [11:32:13] I am not sure at all that having fewer tilerator instances will actually make it better [11:32:26] yurik: wanna try it ? [11:32:46] it does - i already tried that before. I don't want to interrupt the process half-way -- it takes about 30 hours to finish [11:32:58] and i am still unsure if it continues properly [11:33:06] (will need to test it more) [11:33:12] I thought it was job based ? [11:33:30] that's what I got from what you showed me under that /kue url [11:33:30] it is - but each job is about 20 hours [11:33:39] 20 hour job ? [11:33:42] that a big quantum [11:33:46] is* [11:33:56] could it be split up more ? [11:34:00] otherwise we would have billions of unmanagable jobs - i don't have very good tools for large job numbers [11:34:11] but worry not - the job has a "progress data" [11:34:26] which means that when it does progress update , it stores how much of it is done [11:34:38] which means i should be able to continue an interrupted job [11:34:52] that would be useful [11:34:53] i have the code for it, but i think the Kue system might be broken cross-machines [11:35:08] that does not sound reassuring [11:35:15] what's the architecture of that thing ? [11:35:26] so will need to double check if progress data makes it to the job runner on restart [11:35:38] they just added the progress data, so might still have some quirks [11:35:41] redis [11:35:55] everything is stored in redis, and nodejs is just a wrapper around it [11:36:03] nodejs Kue lib [11:36:16] https://github.com/Automattic/kue ? [11:36:18] yep [11:37:15] the restart has worked on the local instance, i just suspect that it was locally cached instead of being pulled out of the redis on the restart. So will want to test after this set of jobs is done [11:38:01] i don't worry about throttling at this point because of low load, but later - yes, will need to throttle it to ~2/3 instances, running 1 "thread" each [11:41:54] _joe_, actually i think from my win3.1 days, it is called non-preemptive multitasking [11:49:28] (03PS1) 10KartikMistry: Log for Apertium [puppet] - 10https://gerrit.wikimedia.org/r/230992 (https://phabricator.wikimedia.org/T108797) [11:54:58] akosiaris, lol, we organized them wrong -- 2002 & 2004 are 12 core machines with 96gb, whereas 2001 & 2003 are 8 with 64 [11:55:32] should have made 2002 the master [11:55:53] not a biggie, just need to adjust the number of threads accordingly [12:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150812T1200). [12:02:16] akosiaris, ok, fixed - added two more threads, now 90% on all [12:11:09] deploy! [12:17:02] (03CR) 10Joal: "Hey Marcel," [puppet] - 10https://gerrit.wikimedia.org/r/230825 (https://phabricator.wikimedia.org/T108339) (owner: 10Mforns) [12:17:43] (03Abandoned) 10Aude: Bump cache epoche for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227815 (owner: 10Aude) [12:19:15] (03PS3) 10Aude: Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) [12:19:30] (03CR) 10Aude: [C: 032] Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) (owner: 10Aude) [12:19:36] (03Merged) 10jenkins-bot: Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) (owner: 10Aude) [12:20:51] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Add Wikisource badge config (duration: 00m 11s) [12:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:21:18] 6operations, 7Icinga: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1531370 (10Aklapper) [12:21:39] !log aude@tin Synchronized wmf-config/Wikibase-labs.php: Add Wikisource badge config (duration: 00m 13s) [12:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:19] PROBLEM - HHVM rendering on mw1140 is CRITICAL - Socket timeout after 10 seconds [12:24:38] PROBLEM - puppet last run on mw2107 is CRITICAL Puppet has 1 failures [12:24:50] PROBLEM - Apache HTTP on mw1140 is CRITICAL - Socket timeout after 10 seconds [12:29:39] PROBLEM - HHVM queue size on mw1140 is CRITICAL 80.00% of data above the critical threshold [80.0] [12:30:00] (03PS2) 10Aude: Enable arbitrary access on dewiki, frwiki, jawiki and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230990 (https://phabricator.wikimedia.org/T100787) [12:30:33] (03CR) 10Aude: Enable arbitrary access on dewiki, frwiki, jawiki and s3 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230990 (https://phabricator.wikimedia.org/T100787) (owner: 10Aude) [12:31:52] (03CR) 10Aude: [C: 032] "double checked (with sdiff) this list against usagetracking.dblist and wikidataclient.dblist and looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230990 (https://phabricator.wikimedia.org/T100787) (owner: 10Aude) [12:31:58] (03Merged) 10jenkins-bot: Enable arbitrary access on dewiki, frwiki, jawiki and s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230990 (https://phabricator.wikimedia.org/T100787) (owner: 10Aude) [12:32:09] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 100.00% of data above the critical threshold [86.4] [12:32:22] does hhvm need to be restarted on mw1140? [12:36:54] !log aude@tin Synchronized arbitraryaccess.dblist: Enable arbitrary access on dewiki, frwiki, jawiki and s3 wikis (duration: 00m 12s) [12:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:37:37] done [12:44:58] (03CR) 10Jgreen: [C: 04-1] "I'm probably missing something, but I don't see any other spot in puppet config where the frack aggregators are configured. We talked abou" [puppet] - 10https://gerrit.wikimedia.org/r/219077 (owner: 10Dzahn) [12:51:21] RECOVERY - puppet last run on mw2107 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:54:49] PROBLEM - puppet last run on db2004 is CRITICAL puppet fail [12:58:32] ACKNOWLEDGEMENT - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied Coren Test faulty (see https://phabricator.wikimedia.org/T104975) [12:58:52] (03PS2) 10Alexandros Kosiaris: Get scb up to par with sca [puppet] - 10https://gerrit.wikimedia.org/r/230800 [12:59:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Get scb up to par with sca [puppet] - 10https://gerrit.wikimedia.org/r/230800 (owner: 10Alexandros Kosiaris) [13:00:10] !log disabled puppet on maps-test200X [13:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:50] (03Restored) 10TTO: Allow import from any WMF project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://bugzilla.wikimedia.org/15583) (owner: 10TTO) [13:03:07] (03PS3) 10TTO: Allow import from any WMF project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [13:12:11] akosiaris, are you stopping/restarting any services? [13:12:55] if you need to, go ahead with kartotherian, but pls don't restart/kill anything running under yurik [13:12:57] yurik: yes, probably kartotherian and tilerator [13:13:11] password changes and all [13:13:55] tilerator is not a service yet, and the one running under my account should stay runnig [13:14:13] that's my point, it is going to be [13:14:27] i understand - but we can run more than one instance in parallel [13:14:34] it shouldn't be affected by password changes unless you are removing cassandra's default [13:14:43] can you keep cassandra's default for now? [13:15:00] e.g. add the new ones, but keep the old one until tonight [13:15:23] later we can change the default password on the 'cassandra' user [13:15:38] ok [13:15:42] thx [13:17:29] akosiaris, another note - tilerator service would not start until redis is in place [13:17:50] currently it is "kinda" in place - it runs in usermode under my acct [13:18:00] but we will need it puppetized [13:18:49] (03PS9) 10Alexandros Kosiaris: Added tilerator service, granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [13:18:56] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added tilerator service, granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [13:19:48] RECOVERY - puppet last run on db2004 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:19:57] (03CR) 10Alex Monk: [C: 04-2] "Unmerged dependencies :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [13:22:41] Retrieving watchlist for commons:commons via API. [13:22:42] Result: 503 Service Unavailable [13:22:44] ^^ Same result all morning. Is there a known operational problem with the Commons API? [13:23:23] akosiaris, i have to go offline for a bit, email or hangout msg [13:23:44] Santoni: not that we know of. wanna file a bug ? [13:24:10] Santoni: works for me. Do you have a very large watchlist? [13:24:53] Yes. 831,000 items. But surely the error should be a 504 if timing out rather than 503? [13:24:55] I remember some issues with at least clearing large watchlists, and retrieving them might have comparable issues [13:25:28] The routine I'm running is actually helping to remove items from the watchlist, so being unable to use the API to help is a drag. [13:26:11] I'll add it to phab if it is still returning errors in a couple of hours. [13:27:12] Santoni: you're right -- large watchlists give 500s, not 503s [13:27:42] Ah, so some sort of real bug then. [13:28:24] although https://phabricator.wikimedia.org/T68212 does mention a 503. In any case, even if the api is another example of the same bug, it makes sense to track it [13:30:09] PROBLEM - puppet last run on scb1001 is CRITICAL Puppet has 2 failures [13:31:13] RECOVERY - puppet last run on scb1001 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:41:33] PROBLEM - puppet last run on scb1002 is CRITICAL Puppet has 2 failures [13:46:12] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [13:47:55] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Rename 'roa-rup' wikis to 'rup' - https://phabricator.wikimedia.org/T17988#1531566 (10TTO) [14:01:17] 6operations, 7Icinga: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1531589 (10Dzahn) the -x option of check_disk can exclude devices/filesystems from being checked. example: root@neon:/usr/lib/nagios/plugins# ./check_disk -w 1 -c 1 DISK OK - free s... [14:02:16] PROBLEM - puppet last run on scb1001 is CRITICAL Puppet has 2 failures [14:02:31] (03CR) 10Filippo Giunchedi: puppet-compiler: first commit (0328 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) (owner: 10Giuseppe Lavagetto) [14:04:05] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1531596 (10Ottomata) I did a quick 5 minute googling, and it doesn't look like the varnishlog API ha... [14:05:07] (03CR) 10Alex Monk: Allow import from any WMF project to any other (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [14:05:45] (03CR) 10Alex Monk: " maybe with the config though - move wgImportSources to wmgProdImportSources and only set it to the real variable if wmfRealm ===" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [14:08:19] 6operations, 10Traffic: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#1531608 (10BBlack) 3NEW [14:11:09] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1531627 (10BBlack) I don't think we'd want to even if we could, TBH. The privacy implications are b... [14:13:48] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build 0.8.2.1 Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1531633 (10Ottomata) [14:15:14] (03PS1) 10BBlack: tlsproxy: add deferred/backlog options [puppet] - 10https://gerrit.wikimedia.org/r/231006 [14:15:26] (03PS4) 10Dzahn: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) [14:16:13] (03CR) 10Dzahn: ferm rules for nutcracker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [14:18:49] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build 0.8.2.1 Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1531644 (10Ottomata) Phew, after much difficulty, the 4 original Precise brokers are now running 0.8.2.1. There was a bug in the version... [14:21:16] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1531648 (10Dzahn) > graphite.wikimedia.org requires a WMF account, while grafana.wikimedia.org is completely open and exposes the exact same set of metrics. As far as i know this is not the case and the set of metrics i... [14:22:27] !log upgrade cassandra on restbase1003 [14:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:50] (03CR) 10Tim Landscheidt: Introduce new labs role for vagrant+lxc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [14:25:34] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1531667 (10Dzahn) We should just install an OS on tungsten. The "bare metal in labs" discussion seems to just have distracted. [14:26:38] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1531676 (10Dzahn) >>! In T106563#1471867, @RobH wrote: > There is an ongoing discussion on what vlan this can live in, installation cannot progress until resolved. One that lets it t... [14:30:45] !log upgrade cassandra on restbase1004 [14:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:44] !log upgrade cassandra on restbase1008 [14:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:53] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1531712 (10fgiunchedi) >>! In T107949#1527473, @Eevans wrote: >>>! In T107949#1526823, @fgiunchedi wrote: >> upgrade plan, starting today: >> * upgrade row A machines, (restbase100[127]) w... [14:38:54] (03CR) 10BBlack: [C: 032] tlsproxy: add deferred/backlog options [puppet] - 10https://gerrit.wikimedia.org/r/231006 (owner: 10BBlack) [14:41:07] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:50:03] (03PS1) 10Jcrespo: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231012 [14:51:13] 6operations, 10OTRS: Upgrade OTRS to latest stable release (4.0 or later) - https://phabricator.wikimedia.org/T74109#1531740 (10Mdann52) p:5Low>3High Any progress on this yet? Considering we are using an unsupported version now, I'm upgrading priority. [14:54:44] (03PS2) 10BBlack: varnish: director->backends is now always an array [puppet] - 10https://gerrit.wikimedia.org/r/230806 [14:55:51] (03CR) 10BBlack: [C: 032] varnish: director->backends is now always an array [puppet] - 10https://gerrit.wikimedia.org/r/230806 (owner: 10BBlack) [14:56:34] (03CR) 10Jcrespo: [C: 032] Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231012 (owner: 10Jcrespo) [14:57:30] 6operations, 10Traffic, 7HTTPS: Samsung GT-S3650 can't connect to Wikipedia - https://phabricator.wikimedia.org/T108298#1531755 (10Mdann52) I've met similar cases such as this on OTRS, most centring on older models of phones that do not support HTTPS. Unfortunately, unless we are going to enable support for... [14:58:01] (03PS1) 10Dzahn: Revert "remove production tungsten dns" [dns] - 10https://gerrit.wikimedia.org/r/231015 [14:58:03] (03CR) 10jenkins-bot: [V: 04-1] Revert "remove production tungsten dns" [dns] - 10https://gerrit.wikimedia.org/r/231015 (owner: 10Dzahn) [14:59:19] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2040 (duration: 00m 11s) [15:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150812T1500). [15:00:18] (03Abandoned) 10Dzahn: Revert "remove production tungsten dns" [dns] - 10https://gerrit.wikimedia.org/r/231015 (owner: 10Dzahn) [15:00:40] I have a couple of things [15:00:55] (03PS2) 10Alex Monk: Import sources for mr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228475 (https://phabricator.wikimedia.org/T105116) (owner: 10Dereckson) [15:01:01] (03CR) 10Alex Monk: [C: 032] Import sources for mr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228475 (https://phabricator.wikimedia.org/T105116) (owner: 10Dereckson) [15:01:08] (03Merged) 10jenkins-bot: Import sources for mr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228475 (https://phabricator.wikimedia.org/T105116) (owner: 10Dereckson) [15:01:41] (03CR) 10Alex Monk: [C: 04-1] "-1 until Nikerabbit's comment is addressed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224772 (https://phabricator.wikimedia.org/T105124) (owner: 10Dereckson) [15:02:39] 6operations, 10Traffic: Support ALPN + HTTP/2 - https://phabricator.wikimedia.org/T96848#1531771 (10BBlack) Status update: [[ https://www.nginx.com/blog/early-alpha-patch-http2/ | nginx has announced ]] an [[ http://nginx.org/patches/http2/ | an early-alpha quality patch ]], which I've [[ https://gerrit.wikime... [15:03:10] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/228475/ (duration: 00m 12s) [15:04:20] !log upgrading kernel and rebooting labvirt1001 as per https://phabricator.wikimedia.org/T99738 [15:05:06] (03PS4) 10BBlack: HTTP/2 alpha patch [software/nginx] (wmf-1.9.3-1-h2) - 10https://gerrit.wikimedia.org/r/230040 (https://phabricator.wikimedia.org/T96848) [15:09:03] (03PS5) 10BBlack: varnish: get rid of some pre-systemd cruft [puppet] - 10https://gerrit.wikimedia.org/r/228591 [15:10:08] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:17] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 0.64 ms [15:17:26] (03PS1) 10Dzahn: wmnet: fix indentations for readability [dns] - 10https://gerrit.wikimedia.org/r/231020 [15:21:39] (03PS1) 10Ottomata: Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 [15:21:44] (03CR) 10jenkins-bot: [V: 04-1] Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (owner: 10Ottomata) [15:21:58] (03PS2) 10Ottomata: Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (https://phabricator.wikimedia.org/T106581) [15:22:00] (03CR) 10jenkins-bot: [V: 04-1] Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [15:22:35] (03PS3) 10Ottomata: Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (https://phabricator.wikimedia.org/T106581) [15:22:37] (03CR) 10jenkins-bot: [V: 04-1] Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [15:23:14] (03PS4) 10Ottomata: Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (https://phabricator.wikimedia.org/T106581) [15:47:55] (03PS1) 10coren: nrpe: systemd unit status check tweak [puppet] - 10https://gerrit.wikimedia.org/r/231024 [15:49:55] apergos: there for a quick review? ^^ [15:50:06] The change is very small, but affects a lot of boxen. [15:51:49] (03PS1) 10Dzahn: re-add IP for node tungsten [dns] - 10https://gerrit.wikimedia.org/r/231025 (https://phabricator.wikimedia.org/T106563) [15:51:51] Well, to be more precise, the check it affects run on a lot of 'em [15:54:28] (03PS8) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [15:54:53] Anyone got a sec? Pretty trivial ^ [15:55:30] (03CR) 10Ori.livneh: [C: 032] Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 (owner: 10Chad) [15:55:38] I'll look at yours if you look at mine. :-) [15:55:47] Oh, bah. Ninja'ed [15:56:10] Coren: Heh, need something looked at? [15:56:18] ori: Thx! [15:56:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1531897 (10Tfinc) Approved [15:56:39] https://gerrit.wikimedia.org/r/231024 [15:56:53] Coren: do you have a link to the docs to confirm that? seems funky [15:57:14] (03CR) 10Chad: [C: 031] "Seems legit. +1 because I can't +2" [puppet] - 10https://gerrit.wikimedia.org/r/231024 (owner: 10coren) [15:57:32] ostriches: https://gerrit.wikimedia.org/r/#/c/231026/ at you :P [15:57:49] ostriches: basically reorders tags so the stylesheets are first [15:57:53] ori: No - that thing is completely crappily undocumented. I found this out by experimenting. [15:58:19] ori: +2 [15:58:24] thanks [15:58:33] yw [15:59:12] ori: Basically, the observed behaviour is that the state gets 'activating' when the ExecStart is ran, and switches to 'active' once it forks. [15:59:40] (03CR) 10Alexandros Kosiaris: Introduce new labs role for vagrant+lxc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [15:59:41] ori: Which never happens when it's a oneshot [16:00:04] jdlrobson kaldari codezee: Dear anthropoid, the time has come. Please deploy Special deploy window from Greg (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150812T1600). [16:00:32] mystery deploy:) [16:00:42] Coren: I think it might be a bug in the unit file, then [16:00:47] maybe bblack would know? [16:00:50] <_joe_> Coren: uhm, scripts launched by systemd should preferably not fork or daemonize [16:01:06] _joe_: I know - and mine doesn't. [16:01:08] I found this (admittedly not very relevant) https://bbs.archlinux.org/viewtopic.php?id=156959 in which a service being stuck in activating is reported as an issue [16:01:12] <_joe_> ori: it is well possible, but I agree the docs are horribly scarce [16:01:14] jdlrobson: what is special deploy ?:) [16:01:28] mutante: Wikivoyage page banners [16:01:31] just scanning for kaldari [16:01:34] <_joe_> so it's mostly matter of looking at the code, I guess [16:01:49] present :) [16:01:55] jdlrobson: ah:) [16:02:05] http://pagebanner.wmflabs.org/wiki/Unesco [16:02:19] kaldari: sweet. let's get this show on the road [16:02:22] Well, this matches current behavior if nothing else. I can readily observe that this is what is happening on a timer that starts a oneshot. [16:02:32] jdlrobson: lemme go pee first... [16:02:45] (03PS2) 10Chad: Elastic: Unify default plugins_dir in /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/221766 [16:02:49] The ExecStart commandline is running correctly, and the service stays 'activating' for the duration. [16:02:58] ori: Also trivial, same type of cleanup for elastic ^ [16:03:30] I can hold off on the patch if we think it's worth further investigation, but it makes the nrpe check useless for timer jobs in the meantime. :-( [16:04:12] (03PS2) 10Mforns: Change percentage in EventLogging validation alert [puppet] - 10https://gerrit.wikimedia.org/r/230825 (https://phabricator.wikimedia.org/T108339) [16:04:20] (03CR) 10Ori.livneh: [C: 032] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/221766 (owner: 10Chad) [16:04:21] Coren: I tend to agree most likely the unit is defective, and also I'd hate to see the monitor fail to notice good unit files that are truly failing when stuck in 'activating' [16:04:22] (03CR) 10Ottomata: [C: 032] Update JMX metrics names for Kafka 0.8.2+ [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231021 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:04:48] bblack: Want to take a peek at the .service and .timer then and see if you find the bug? [16:04:50] perhaps the bad units need to switch from Type=forking to Type=simple ? [16:05:05] They're Type=oneshot [16:05:17] (03PS1) 10Ottomata: Update alerts and jmx for Kafka 0.8.2+ [puppet] - 10https://gerrit.wikimedia.org/r/231028 (https://phabricator.wikimedia.org/T106581) [16:05:24] (03CR) 10jenkins-bot: [V: 04-1] Update alerts and jmx for Kafka 0.8.2+ [puppet] - 10https://gerrit.wikimedia.org/r/231028 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:05:26] do-the-thing-then-go-away [16:05:29] (03PS2) 10Ottomata: Update alerts and jmx for Kafka 0.8.2+ [puppet] - 10https://gerrit.wikimedia.org/r/231028 (https://phabricator.wikimedia.org/T106581) [16:05:41] jdlrobson: do you have a config change already? [16:05:47] Coren: we could also just use a script and a cron entry for that stuff and skip systemd altogether, since it's such a strange pattern / use-case? [16:05:58] anyways, where is it at? [16:06:10] RemainAfterExit= ? [16:06:21] bblack: modules/labstore/templates/initscripts/replicate* [16:06:41] bblack: That's what I wanted to do, but everyone yelled at me and said 'use systemd' :-) [16:06:54] bblack: That said, it works perfectly fine. [16:07:04] and yeah ori's probably right [16:07:07] bblack: Just that the state while it runs is 'activating' rather than 'active' [16:07:12] RemainAfterExit sounds like it was designed for this problem [16:07:31] Behavior of oneshot is similar to simple; however, it is expected that the process has to exit before systemd starts follow-up units. RemainAfterExit= is particularly useful for this type of service. [16:07:34] (03Abandoned) 10Chad: MW releases: create shared build directory in /srv [puppet] - 10https://gerrit.wikimedia.org/r/230601 (owner: 10Chad) [16:07:36] RemainAfterExit= [16:07:37] Takes a boolean value that specifies whether the service shall be considered active even when all its processes exited. Defaults to no. [16:07:59] bblack: That does the /opposite/ of what we want for a timer! :-) [16:08:10] which is what? [16:08:28] you're asking for it to look "active" if it has recently run successfully and now exited, which sounds like what RemainAfterExit does [16:08:35] right? [16:09:10] No... I'm testing two things: whether it's currently running or whether it's not running now but has last started within $interval [16:09:39] (03CR) 10Ottomata: [C: 032] Update alerts and jmx for Kafka 0.8.2+ [puppet] - 10https://gerrit.wikimedia.org/r/231028 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [16:09:39] why does the check care whether it's currently running? [16:09:47] The only thing that's an issue is that if it's currently running, it shows as 'activating' rather than 'active'. Once it's done, it goes back to 'inactive' and the last start dates are all kosher. [16:10:24] do you still get a (new-ish) start date during activating? [16:10:47] Hm. You know, I didn't consider it might update early and not at the end. Lemme check. [16:10:59] !log ori@tin Synchronized php-1.26wmf18/includes/OutputPage.php: 2b247d3240: Output stylesheet links before other link elements in (duration: 00m 11s) [16:12:06] Nicolas: codezee just to check - we are not enabling default banners right now correct? [16:12:12] Yes [16:12:14] just deploying it to experiment with it [16:12:18] config changes later [16:12:34] codezee: should i be setting $wgWPBBannerProperty at this point? [16:12:53] bblack: Ah, good call. I can actually rely on just the ExecMainStartTimestamp... [16:13:01] jdlrobson: not needed, if wikidata banners are not required at this stage [16:13:03] jdlrobson: doing submodule update locally. this will take a little while. [16:13:15] bblack: Except that if I check the Result now while it's running I get the /previous/ run if the test happens during a run. [16:13:35] you need a time window on that anyways or it's racy vs execution times right? [16:13:36] (03PS1) 10Ottomata: Use correct class graphite::monitoring_threshold for Kafka graphite based alert [puppet] - 10https://gerrit.wikimedia.org/r/231031 [16:13:41] okay that will simplify things - this will allow you to just test with editor provided templates (make sure you make phabricator tasks for switching those two things on :)) [16:13:49] just make sure the time window can account for normal period + exec time and it should still be in-window [16:13:58] (03CR) 10Ottomata: [C: 032 V: 032] Use correct class graphite::monitoring_threshold for Kafka graphite based alert [puppet] - 10https://gerrit.wikimedia.org/r/231031 (owner: 10Ottomata) [16:14:00] !log ori@tin Synchronized php-1.26wmf17/includes/OutputPage.php: I2089b21fc: Output stylesheet links before other link elements in (duration: 00m 12s) [16:14:21] jdlrobson: Sure, I'll create that right now [16:14:35] Alright, I need to add a test in case there is no previous Result though - stopping the timer will lose that. [16:14:41] I mean, we don't really need to alert on that kind of thing unless it's truly broken, I'd even just go for period*2 [16:14:53] (03PS1) 10Jdlrobson: Deploy WikidataPageBanner extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231032 (https://phabricator.wikimedia.org/T98029) [16:14:56] * Coren rejiggers the patch. [16:15:09] ^ kaldari [16:15:21] codezee: fyi P948 is the wikidata property id :) [16:15:39] jdlrobson: yes, I'm aware :) [16:15:58] bblack: Ah, not even. That survives a restart. Good. [16:16:31] PROBLEM - HHVM rendering on mw1144 is CRITICAL - Socket timeout after 10 seconds [16:16:41] PROBLEM - Apache HTTP on mw1144 is CRITICAL - Socket timeout after 10 seconds [16:16:51] i'll look into it [16:17:13] bblack: All of that said, while the test will no longer rely on it the unit's state will remain 'activating' while it's running. If you felt that was an issue it also needs to be looked into. :-) [16:17:28] !log mw1144 locked up due to StatCache, restarting [16:17:52] _joe_: blech, it still happens periodically, though it's quite rare nowadays [16:18:02] (03PS2) 10coren: nrpe: simplify systemd unit status check [puppet] - 10https://gerrit.wikimedia.org/r/231024 [16:18:10] bblack: ^^ [16:18:16] backtrace in mw1144:/tmp/hhvm.28643.bt, though nothing interesting there [16:18:31] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 65359 bytes in 0.212 second response time [16:18:35] <_joe_> ori: oh, well, when did that happen last? I think ~ 1 month ago? [16:18:50] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [16:19:17] yeah [16:19:24] (03CR) 10BBlack: [C: 031] nrpe: simplify systemd unit status check [puppet] - 10https://gerrit.wikimedia.org/r/231024 (owner: 10coren) [16:20:10] (03PS3) 10coren: nrpe: simplify systemd unit status check [puppet] - 10https://gerrit.wikimedia.org/r/231024 [16:20:20] we should probably do a generic nagios check on 'systemctl is-system-running' for all jessies too, seems like a nice cover for random issues [16:20:40] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [16:20:50] between that and the systemd unit checks, we can probably get rid of direct process count checkers [16:21:38] bblack: That sounds like a good plan. [16:21:53] (03CR) 10coren: [C: 032] nrpe: simplify systemd unit status check [puppet] - 10https://gerrit.wikimedia.org/r/231024 (owner: 10coren) [16:22:00] that'd be nice, I was also working to export stats for running services from upstart/systemd if anyone is interested in reviewing, https://gerrit.wikimedia.org/r/#/c/224093/ [16:22:34] (03PS5) 1020after4: sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) (owner: 10Alex Monk) [16:22:48] (03CR) 1020after4: [C: 031] sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) (owner: 10Alex Monk) [16:22:56] godog: Oooo. Usage metrics. [16:23:05] per-unit [16:23:40] do they reset on restart of a service, or just on reboot of the machine? [16:24:08] Coren: yep! lacking some puppet integration yet with service_unit [16:24:16] bblack: on restart [16:24:31] 6operations, 10Deployment-Systems, 10RESTBase, 6Release-Engineering, 6Services: Get ops feedback regarding the use of SSH for deployment system control channel. - https://phabricator.wikimedia.org/T102687#1531968 (10mmodell) 5Open>3Resolved a:3mmodell [16:24:39] PROBLEM - HHVM busy threads on mw1144 is CRITICAL 50.00% of data above the critical threshold [86.4] [16:24:59] PROBLEM - HHVM queue size on mw1144 is CRITICAL 33.33% of data above the critical threshold [80.0] [16:25:31] _joe_ or ottomata (or anyone)...could you take a look at https://phabricator.wikimedia.org/T103000 and let us know next steps? [16:25:34] (03CR) 10Ori.livneh: [C: 031] diamond: add upstart/systemd service stats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224093 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [16:25:45] (it's a labs puppet patch) [16:25:47] 6operations, 10Deployment-Systems: Unhashable type: dict error when running salt --batch-size - https://phabricator.wikimedia.org/T99776#1531974 (10thcipriani) 5Open>3Resolved Haven't heard any reports of this error—if that's incorrect, please re-open. [16:27:26] RECOVERY - HHVM busy threads on mw1144 is OK Less than 30.00% above the threshold [57.6] [16:27:45] RECOVERY - HHVM queue size on mw1144 is OK Less than 30.00% above the threshold [10.0] [16:30:33] (03PS1) 10coren: labstore: ignore the replication snapshots with check_disk [puppet] - 10https://gerrit.wikimedia.org/r/231043 [16:34:00] codezee: are you on standby to test ? :) [16:34:15] jdlrobson: yes [16:39:17] roots: nutcracker on mw1123 needs to be restarted. 3,177,689 errors in the last hour [16:39:36] in related news, we use the heck out to memcached [16:40:20] apergos: ^^ [16:41:21] !log restarted nutcracker on mw1123 [16:41:26] bd808: done [16:41:30] thanks mutante [16:41:56] jdlrobson: OK, I'm going to deploy the submodule addition right now. after that's done, we'll do the config change. [16:42:16] PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL Anomaly detected: 0 data above and 45 below the confidence bounds [16:43:26] !! [16:44:30] (03PS2) 10Dzahn: re-add IP for node tungsten [dns] - 10https://gerrit.wikimedia.org/r/231025 (https://phabricator.wikimedia.org/T106563) [16:45:08] (03CR) 10Dzahn: [C: 032] "same IP it used before, no ping reply and not in DNS" [dns] - 10https://gerrit.wikimedia.org/r/231025 (https://phabricator.wikimedia.org/T106563) (owner: 10Dzahn) [16:46:58] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1532022 (10Dzahn) [16:47:27] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471811 (10Dzahn) [16:48:01] (03CR) 10BryanDavis: "We need to make a hiera addition for deployment-logstash2 to follow this up or Logstash will stop working." [puppet] - 10https://gerrit.wikimedia.org/r/207140 (owner: 10Chad) [16:48:43] (03PS6) 10BryanDavis: logstash: Count MediaWiki log events with statsd [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) [16:48:45] (03PS4) 10BryanDavis: logstash: normalize "level" fields across log types [puppet] - 10https://gerrit.wikimedia.org/r/230922 [16:48:53] bblack: https://gerrit.wikimedia.org/r/#/c/231043/ if you have another minute - that's the last wonky test on the labstores [16:48:56] (03CR) 10Dzahn: "it may seem trivial but it rarely is" [puppet] - 10https://gerrit.wikimedia.org/r/207140 (owner: 10Chad) [16:49:11] (03CR) 10BryanDavis: "PS6 was a manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [16:49:47] godog: Do you think it would be safe to try https://gerrit.wikimedia.org/r/#/c/230233/ today or tomorrow? (logstash -> statsd) [16:49:56] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:35] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1532040 (10fgiunchedi) I'm trying the systemd instances first, (no puppet code review yet), basically the idea is to set `CASSANDRA_CONF` to point to a separate config... [16:53:01] bd808: yeah let's give it a try [16:53:44] bd808: today is fine too, I'm going to merge it shortly if you are around [16:54:22] godog: works for me. I'm here all day [16:54:58] (03PS1) 10Chad: Followup Ibd58670e: Also set auto_create_index for beta logstash [puppet] - 10https://gerrit.wikimedia.org/r/231049 [16:55:37] (03CR) 10Chad: "Would it not get handled by deployment-prep's common settings? If not... I5f74f8d6" [puppet] - 10https://gerrit.wikimedia.org/r/207140 (owner: 10Chad) [16:55:45] jdlrobson: looks like we're not going to finish in the window :( There's a security patch applied to 1.26wmf17 on tin which is going to complicate things (since doing a submodule update will overwrite it). We'll need to schedule another window to complete. Sorry! [16:56:40] hmpfff do we have to roll back everything you've done? [16:57:00] jdlrobson: also, something is wrong with hhvm on https://gerrit.wikimedia.org/r/#/c/231044/ [16:57:07] (03PS1) 10Dzahn: tungsten: change partman recipe to use RAID 10 [puppet] - 10https://gerrit.wikimedia.org/r/231050 (https://phabricator.wikimedia.org/T106563) [16:58:02] (03CR) 10BryanDavis: [C: 04-1] Followup Ibd58670e: Also set auto_create_index for beta logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231049 (owner: 10Chad) [16:59:17] bd808: Ahhh, ok [16:59:36] ostriches: I should have been more clear in the first whine [16:59:38] In which case...how was it even working before? [16:59:39] :p [16:59:50] good question... [17:00:11] * ostriches bets [[Hiera:Deployment-prep]] [17:00:29] nerp [17:00:30] oh, no. I don't use role::elastcisearch [17:00:31] hmmm [17:00:37] I use the class directly [17:00:53] so the default "*" worked for me [17:01:12] that won't work anymore once we want firewalling [17:01:18] this is the labs role hiera problem [17:01:37] Ah bleh. Yeah. [17:01:40] mutante: what won't work? Using the es class not the role? [17:01:44] yes [17:02:04] well the role is very specific to cirrussearch [17:02:09] Not really. [17:02:12] I've cleaned it up. [17:02:14] Use hiera. [17:03:09] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/role/elasticsearch.pp [17:03:10] :) [17:03:23] And I have a WIP to remove merge_threads from there too [17:04:11] ostriches: still bigger than mine -- https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logstash.pp#L97-L112 [17:04:21] not by much though [17:04:46] Only material difference is rack/row detection and LVS that I see [17:05:00] And hot_threads, which you don't include [17:10:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [17:10:39] (03CR) 10EBernhardson: Introduce new labs role for vagrant+lxc (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [17:10:58] (03PS2) 10EBernhardson: Introduce new labs role for vagrant+lxc [puppet] - 10https://gerrit.wikimedia.org/r/230928 [17:12:41] (03CR) 10BryanDavis: "Cherry-picked to beta cluster again after updating default mapping to treat level as a not_analysed string for new indices. Works as expec" [puppet] - 10https://gerrit.wikimedia.org/r/230922 (owner: 10BryanDavis) [17:23:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] logstash: Count MediaWiki log events with statsd [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [17:24:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:24:39] bd808: ^ merged, I'm looking at http://grafana.wikimedia.org/#/dashboard/db/graphite-eqiad [17:25:01] I'll force on one host to start it up [17:26:39] applied on logstash1001 (cron did it before I got there) [17:28:38] bd808: cool, the received under statsite should be jumping up a little, that's the statsd aggregators [17:29:23] seeing stats trickle into graphite [17:31:05] RECOVERY - Kafka Broker Messages In Per Second on graphite1001 is OK No anomaly detected [17:33:19] bd808: yeah I'm not seeing any significant difference so far [17:33:33] (03CR) 10Matanya: [C: 031] tungsten: change partman recipe to use RAID 10 [puppet] - 10https://gerrit.wikimedia.org/r/231050 (https://phabricator.wikimedia.org/T106563) (owner: 10Dzahn) [17:34:47] godog: forcing puppet on logstash1002 [17:35:10] bd808: I'm always amazed at how inefficient statsd is at volume, 40bytes packets over packets [17:37:06] godog: it's on on all 3 logstash hosts now so it should be nearing normal volume [17:38:13] bd808: yep, looks good to me so far [17:38:33] COOL! anomoly working?! [17:41:55] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1532114 (10akosiaris) [17:50:57] (03PS2) 10ArielGlenn: Add ssh key for new notebook [puppet] - 10https://gerrit.wikimedia.org/r/230920 (owner: 10Hoo man) [17:53:01] (03CR) 10ArielGlenn: [C: 032] Add ssh key for new notebook [puppet] - 10https://gerrit.wikimedia.org/r/230920 (owner: 10Hoo man) [17:56:51] (03PS1) 10Ottomata: List ReplicaManager JMX Kafka metrics individually [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231064 [17:56:53] (03CR) 10jenkins-bot: [V: 04-1] List ReplicaManager JMX Kafka metrics individually [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231064 (owner: 10Ottomata) [17:57:22] (03PS2) 10Ottomata: List ReplicaManager JMX Kafka metrics individually [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231064 [17:57:48] FYI, I'm rolling back the WikidataPageBanner submodule addition to wmf17 since we weren't able to complete the deployment in time. Should be finished in a few minutes. [17:59:04] (03CR) 10Ottomata: [C: 032] List ReplicaManager JMX Kafka metrics individually [puppet/kafka] - 10https://gerrit.wikimedia.org/r/231064 (owner: 10Ottomata) [17:59:37] (03PS1) 10Ottomata: Update kafka module wiht ReplicaManager jmxtrans fix [puppet] - 10https://gerrit.wikimedia.org/r/231067 [18:00:05] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150812T1800). [18:00:41] (03CR) 10Ottomata: [C: 032] Update kafka module wiht ReplicaManager jmxtrans fix [puppet] - 10https://gerrit.wikimedia.org/r/231067 (owner: 10Ottomata) [18:02:19] just waiting for Zuul to merge https://gerrit.wikimedia.org/r/#/c/231063/ [18:04:08] done [18:06:16] (03PS3) 10ArielGlenn: Add Chris Steipp to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/230142 (https://phabricator.wikimedia.org/T108227) (owner: 10Andrew Bogott) [18:10:35] (03PS4) 10ArielGlenn: Add Chris Steipp to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/230142 (https://phabricator.wikimedia.org/T108227) (owner: 10Andrew Bogott) [18:12:13] (03CR) 10ArielGlenn: [C: 032] Add Chris Steipp to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/230142 (https://phabricator.wikimedia.org/T108227) (owner: 10Andrew Bogott) [18:12:15] cmjohnson: hi, what's the status on those 8 other hadoop nodes? [18:13:33] kaldari: so everything is clear for the train deployment now? [18:18:28] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532207 (10Ottomata) We are eventually going to need a way to get more than 1014 bytes of data from... [18:19:29] twentyafterfour: he might be gone now [18:20:53] ok I guess I'm just to assume it's all good then [18:24:13] (03PS1) 1020after4: group1 wikis to 1.26wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231071 [18:24:38] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231071 (owner: 1020after4) [18:24:44] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231071 (owner: 1020after4) [18:25:05] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf18 [18:25:45] (03PS1) 10Ottomata: Use monitoring::graphite_threshold for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/231073 [18:26:48] (03CR) 10Ottomata: [C: 032] Use monitoring::graphite_threshold for Kafka alerts [puppet] - 10https://gerrit.wikimedia.org/r/231073 (owner: 10Ottomata) [18:27:59] uhm [18:28:01] 8981 error: Stack overflow [18:28:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1532236 (10ArielGlenn) merged but there's an issue with puppet runs on stat1002. Last run was a day ago, the puppet process was still sitting around doing nothing. I shot it, tr... [18:28:39] that number doesn't seem to be increasing but... [18:29:45] ottomata: they look good sitting on the floor in the data center right now. I have a plan for them now that the db's I am removing have finished wiping. [18:30:03] (03CR) 10Ottomata: [C: 032] 0.8.2.1-3 release - fix for snappy 1.1.1.6 bug [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/230970 (owner: 10Ottomata) [18:34:22] (03PS1) 10Dereckson: Set site name for mr.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231074 (https://phabricator.wikimedia.org/T104132) [18:39:53] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:41:20] (03PS2) 10Dzahn: tungsten: change partman recipe to use RAID 10 [puppet] - 10https://gerrit.wikimedia.org/r/231050 (https://phabricator.wikimedia.org/T106563) [18:41:27] (03CR) 10Dzahn: [C: 032] tungsten: change partman recipe to use RAID 10 [puppet] - 10https://gerrit.wikimedia.org/r/231050 (https://phabricator.wikimedia.org/T106563) (owner: 10Dzahn) [18:44:16] 6operations: revoke Joel Sahleen's access - https://phabricator.wikimedia.org/T108854#1532268 (10RobH) 3NEW a:3mark [18:44:18] 6operations: revoke Joel Sahleen's access - https://phabricator.wikimedia.org/T108854#1532268 (10RobH) a:5mark>3RobH [18:44:47] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1532281 (10RobH) Just another example of our current offboarding being broken: Joel Sahleen's offboard was missed, now on T108854 [18:45:25] twentyafterfour: yes, sorry I missed your ping, was in a meeting [18:45:54] kaldari: it's ok, I went ahead and pushed wmf18 to group1 [18:46:00] cool [18:48:16] (03PS1) 10Dzahn: tungsten: set DHCP options to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/231076 (https://phabricator.wikimedia.org/T106563) [18:49:06] (03PS2) 10Dzahn: tungsten: set DHCP options to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/231076 (https://phabricator.wikimedia.org/T106563) [18:49:50] (03CR) 10Dzahn: [C: 032] tungsten: set DHCP options to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/231076 (https://phabricator.wikimedia.org/T106563) (owner: 10Dzahn) [18:51:24] 6operations: revoke Joel Sahleen's access - https://phabricator.wikimedia.org/T108854#1532298 (10RobH) They are no longer on the staff page: https://wikimediafoundation.org/w/index.php?title=Template:Staff_and_contractors&diff=101224&oldid=101221 Also they aren't in our contact lists, but are members of escalat... [18:54:26] jdlrobson: you had Jimmy critique your map? nice! https://pbs.twimg.com/media/CK8TnwQWEAEMCuQ.png [18:55:29] !log hello bot, are you there? [18:55:35] thought so. [18:56:29] andrewbogott: somehow we accidentally killed production-logbot [18:56:41] valhallasw`cloud: ok, I will restart [18:56:45] (KeyboardInterrupt at 14:49:11) [18:56:56] 6operations: revoke Joel Sahleen's access - https://phabricator.wikimedia.org/T108854#1532320 (10RobH) I confirmed he left with Amir, so proceeding. [18:56:56] (03CR) 10Legoktm: [C: 04-1] Deploy WikidataPageBanner extension (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231032 (https://phabricator.wikimedia.org/T98029) (owner: 10Jdlrobson) [18:56:58] I'll do it -- I'm logged in atm [18:57:04] workign on fab-ifying the bot [18:57:11] (03PS2) 10BryanDavis: Followup Ibd58670e: Also set auto_create_index for beta logstash [puppet] - 10https://gerrit.wikimedia.org/r/231049 (owner: 10Chad) [18:57:38] (03PS1) 10RobH: revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 [18:58:20] (03CR) 10BryanDavis: "Cherry-picked to beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/231049 (owner: 10Chad) [18:58:24] (03CR) 10jenkins-bot: [V: 04-1] revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 (owner: 10RobH) [18:58:27] (03PS2) 10RobH: revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 [18:58:43] andrewbogott: I'm a bit confused by the set of config files though. Which bots are supposed to be running, and which ones aren't? [18:59:08] (03CR) 10jenkins-bot: [V: 04-1] revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 (owner: 10RobH) [18:59:14] qa-morebots, labs-logbot, analytics-logbot and production-logbot are running now [18:59:20] blehhhhh double comma [18:59:36] but in confs there's also mostbots [18:59:38] (03PS3) 10RobH: revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 [18:59:42] valhallasw`cloud: I think that’s all of them. If there’s a config for e3 that can be deleted. [18:59:50] ah yeah, and e3 [18:59:51] ok [18:59:59] I don’t know what mostbots is, someone else made it. Maybe check and see what room it connects to? [19:00:16] (03CR) 10RobH: [C: 032] revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 (owner: 10RobH) [19:00:23] hm, "www.mediawiki.org/wiki/HHVM/Server_Admin_Log" [19:00:29] (03PS4) 10RobH: revoke Joel Sahleen's access [puppet] - 10https://gerrit.wikimedia.org/r/231078 [19:00:39] and #mediawiki-core? [19:01:00] ok, I think that’s still in use then [19:01:01] maybe [19:01:09] I don't think anyone uses that [19:01:21] (03CR) 10coren: [C: 031] "This all looks good to me. Have you attempted a dry run of this?" [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) (owner: 10Andrew Bogott) [19:01:56] in any case, it's not running [19:02:02] so I'll just wait for complaints :-p [19:02:08] it'll be trivial to add it to the fabfile alter [19:02:11] later* [19:02:18] (03CR) 10Alex Monk: "That group has sudo access as www-data/apache, has this passed ops meeting review? Do you really want to grant full restricted access just" [puppet] - 10https://gerrit.wikimedia.org/r/230974 (https://phabricator.wikimedia.org/T108696) (owner: 10Dzahn) [19:02:57] 6operations: revoke Joel Sahleen's access - https://phabricator.wikimedia.org/T108854#1532325 (10Dzahn) His LinkedIn profile says he is at Adobe since March 2015. [19:09:24] (03CR) 10Dzahn: "no, it did not go through any reviews but you need a change to have something specific to review. it is merely a "translation" of the requ" [puppet] - 10https://gerrit.wikimedia.org/r/230974 (https://phabricator.wikimedia.org/T108696) (owner: 10Dzahn) [19:10:25] (03Abandoned) 10Dzahn: admins: add tjones to restricted [puppet] - 10https://gerrit.wikimedia.org/r/230974 (https://phabricator.wikimedia.org/T108696) (owner: 10Dzahn) [19:15:13] (03PS1) 10Ori.livneh: xenon: generate reversed flame graphs; prune old svgs [puppet] - 10https://gerrit.wikimedia.org/r/231081 [19:15:26] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon: generate reversed flame graphs; prune old svgs [puppet] - 10https://gerrit.wikimedia.org/r/231081 (owner: 10Ori.livneh) [19:16:19] (03CR) 10Andrew Bogott: "yep, a dry run and also an active run on a single mostly-empty project. Seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) (owner: 10Andrew Bogott) [19:19:38] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1532356 (10RobH) [19:19:40] 6operations: revoke Joel Sahleen's access - https://phabricator.wikimedia.org/T108854#1532354 (10RobH) 5Open>3Resolved Following items on https://office.wikimedia.org/wiki/VerboseOffboard#Ops : I've revoked his shell access, downgraded his wikitech (removed shell), and confirmed he is not in the wmf or nda... [19:19:46] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build 0.8.2.1 Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1532357 (10JAllemandou) >>! In T106581#1531644, @Ottomata wrote: > Phew, after much difficulty, the 4 original Precise brokers are now ru... [19:23:24] PROBLEM - puppet last run on mw2133 is CRITICAL puppet fail [19:24:01] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1532372 (10RobH) [19:25:13] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Forbidden - string OK not found on https://127.0.0.1:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 418 bytes in 0.012 second response time [19:26:14] ^^ on the payments1004 situation, it has to do with modsecurity testing [19:27:46] (03PS1) 10Dzahn: Revert "tungsten: set DHCP options to install jessie" [puppet] - 10https://gerrit.wikimedia.org/r/231087 [19:28:04] (03PS1) 10Ori.livneh: Apache: set DeflateCompressionLevel to 9 [puppet] - 10https://gerrit.wikimedia.org/r/231088 [19:29:07] (03PS2) 10Dzahn: Revert "tungsten: set DHCP options to install jessie" [puppet] - 10https://gerrit.wikimedia.org/r/231087 [19:30:13] RECOVERY - check_payments_wiki on payments1004 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.047 second response time [19:30:30] (03CR) 10Dzahn: [C: 032] Revert "tungsten: set DHCP options to install jessie" [puppet] - 10https://gerrit.wikimedia.org/r/231087 (owner: 10Dzahn) [19:31:21] (03PS1) 10Ottomata: Install python-pykafka for eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/231099 [19:31:34] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1532392 (10Ironholds) Who /is/ the on-duty person? [19:33:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1532409 (10ArielGlenn) pretty sure there's a problem with the nfs mount from dataset1001 (ferm rules maybe?), the mount works fine on the snapshot hosts but I can't even do an nfs... [19:35:15] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1532416 (10Dzahn) we have a rotation there. this week: On Ops duty: apergos (@ArielGlenn) it can be found in the topic of the #wikimedia-operations channel [19:36:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1532417 (10Dzahn) the discussion was: Alex Monk: "That group has sudo access as www-data/apache, has this passed ops meeting review? Do you really want to grant full... [19:42:55] (03PS1) 10BBlack: 403 for maps - T108765 [puppet] - 10https://gerrit.wikimedia.org/r/231141 [19:43:13] (03CR) 10BBlack: [C: 032 V: 032] 403 for maps - T108765 [puppet] - 10https://gerrit.wikimedia.org/r/231141 (owner: 10BBlack) [19:43:14] 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1532461 (10Krenair) https://wikitech.wikimedia.org/wiki/File:Wikimedia_labs_logo.svg I was trying to put this on https://wikitech.wikimedia... [19:44:23] hey ariel, want to try to jsut umount -f /mnt/data? [19:45:25] I was trying to nfs remount it [19:45:26] failed [19:45:30] I shot a bunch of things [19:45:32] aye [19:45:47] if you wanna whack away at it some [19:46:01] I see you were trying some ls es [19:47:41] ah ther eis a hung perl job [19:47:54] hm gone. [19:48:03] oh its unmounted, ook... [19:48:14] trying to remount [19:48:31] so ferm rules were put on ds1001 recently [19:49:35] fixes the statd, rpc-mountd and portmapper ports [19:49:43] maybe something needs to be restarted on stat1002? [19:49:56] (03PS1) 10Faidon Liambotis: Remove home_pmtpa and svn client from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/231142 [19:49:58] (03PS1) 10Faidon Liambotis: Remove manifests/stages.pp [puppet] - 10https://gerrit.wikimedia.org/r/231143 [19:50:00] (03PS1) 10Faidon Liambotis: (WIP) Kill misc::limn & limn [puppet] - 10https://gerrit.wikimedia.org/r/231144 [19:50:10] (speaking of NFS :) [19:50:15] reviews welcome [19:50:35] RECOVERY - puppet last run on mw2133 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:52:09] (03CR) 10Milimetric: "this is being used in labs, I'm not great with puppet but I could try to help if you tell me a bit more." [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [19:52:31] milimetric: ah! [19:52:48] I wasn't sure, that's why it's a WIP :) [19:52:59] limn1.eqiad.wmflabs uses the limn module to create some dashboards [19:53:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1532486 (10Peachey88) >>! In T108227#1532409, @ArielGlenn wrote: > pretty sure there's a problem with the nfs mount from dataset1001 (ferm rules maybe?), the mount works fine on t... [19:53:17] we want to move them to a different setup anyway, so the limn module's days are numbered anyway [19:53:44] (03CR) 10Dzahn: [C: 031] "+1, just pending the warning mail to users" [puppet] - 10https://gerrit.wikimedia.org/r/231142 (owner: 10Faidon Liambotis) [19:53:49] but I'm happy to do whatever you see fit with it in the meantime paravoid [19:54:38] milimetric: manifests/* is generally deprecated, we're moving stuff to modules/ and role classes (mostly still under manifests/role/* for now) [19:54:50] milimetric: if you check the tree, limn is really one of the last ones there [19:56:05] milimetric: so anything that (re)moves misc/limn.pp would be progress :) [19:56:15] milimetric: I have no other meaningful input for the rest of the limn stuff yet [19:57:49] so there is also this: [19:57:57] modules/statistics/manifests/limn/ [19:58:22] first question would be "what is the /misc stuff as opposed to the stuff in the module" [19:58:32] and can it be merged into modules/statistics too [19:58:36] hm apergos yea dunno [19:59:03] I don't either, stumped [19:59:23] and it's starting to get to that time of night when my brain shuts down, which doesn't help much (11 pm) [19:59:43] naw, don't limn module is for hosting limn instances [19:59:53] whatever is in statistics limn is likely for generating / copying data that limn uses [20:00:04] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150812T2000). [20:00:09] should there be a single "limn" module that does both? [20:00:27] 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1532495 (10Krenair) It looks like https://wikitech.wikimedia.org/w/images/thumb/6/60/Wikimedia_labs_logo.svg/600px-Wikimedia_labs_logo.svg.... [20:00:37] no [20:00:59] ottomata: what is the preferred way to get limn out of the "misc" structure? [20:01:11] oh, i htink what is in the misc stuff can be ported to a module [20:01:17] but the stat stuff is different [20:02:12] ottomata: alright, but it should _not_ be deleted? then https://gerrit.wikimedia.org/r/#/c/231144/ [20:02:34] (03PS2) 10Ori.livneh: Apache: set DeflateCompressionLevel to 9 [puppet] - 10https://gerrit.wikimedia.org/r/231088 [20:02:45] apergos: are snapshot hosts in .wmnet but stat1002 in wikimedia.org by chance? [20:02:52] looks [20:02:55] o [20:02:56] no [20:03:03] (03CR) 10Ori.livneh: [C: 032 V: 032] Apache: set DeflateCompressionLevel to 9 [puppet] - 10https://gerrit.wikimedia.org/r/231088 (owner: 10Ori.livneh) [20:03:05] they are all eqiad.wmnet [20:03:21] (03PS1) 10Ori.livneh: xenon: invert color scheme, not orientation [puppet] - 10https://gerrit.wikimedia.org/r/231150 [20:03:27] hrmmm, then the firewalling also should not be it [20:03:33] mutante: naw it is used in labs [20:03:40] oh, it is a module! [20:03:40] the rule is to accept from 10.0.0.0/8 [20:03:40] :) [20:03:42] (03PS2) 10Ori.livneh: xenon: invert color scheme, not orientation [puppet] - 10https://gerrit.wikimedia.org/r/231150 [20:03:42] right. [20:03:49] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon: invert color scheme, not orientation [puppet] - 10https://gerrit.wikimedia.org/r/231150 (owner: 10Ori.livneh) [20:03:55] it should not be it [20:03:56] stat1002 is .wmnet [20:04:02] however at one point it worked and now it does not [20:04:18] apergos: what's a snapshot host? [20:04:19] 1001? [20:04:26] snapshot1001 2 3 4 [20:04:30] yup duhi shoudl just try it :) [20:04:33] (03PS1) 10Merlijn van Deen: Add Fabric deploy helper [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 [20:04:36] maybe there are ACLs on network gear? [20:04:39] because analytics vlan [20:04:41] (03CR) 10jenkins-bot: [V: 04-1] Add Fabric deploy helper [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 (owner: 10Merlijn van Deen) [20:04:42] maybe [20:04:43] and snapshot isnt? [20:04:49] it might be something to stat1002 [20:06:18] (03CR) 10Faidon Liambotis: "I'm not sure if you saw these on IRC, copying them here:" [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [20:06:59] ottomata: remember when we had to get ACLs fixed for ganglia aggregator on analytics.. think this is also the case for the NFS mount here? [20:07:04] i'm trying to figure out what nfs port to test telnet/nc to, 2049? [20:07:12] 2049 should be it [20:07:30] mutante: maybe, i'm trying to verify, except this was working before [20:07:30] I see a bunch of happily established connections from the snaps on dataset1001 to that poirt [20:07:42] (03PS2) 10Merlijn van Deen: Add Fabric deploy helper [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 [20:07:44] i can't telnet to that though from snap1001 [20:07:44] (03CR) 10jenkins-bot: [V: 04-1] Add Fabric deploy helper [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 (owner: 10Merlijn van Deen) [20:07:45] we had to set fixed ports for everything with the ferm rules [20:07:53] apergos: isnt it 32765 and 32767 ? [20:08:00] so maybe one of those is out of the range [20:08:08] have to go look tbh [20:08:37] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532517 (10Tgr) So how about bumping the limit @BBlack found in T91347#1249751 to 8192, raising the... [20:08:39] (03PS3) 10Merlijn van Deen: Add Fabric deploy helper [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 [20:09:47] i can't telnet to either of those either [20:10:06] i want a way to test nfs connectivity! :) from snapshot1001, how woudl you verify that you can connect? mutante? [20:10:24] statd is those two ports, nfs is 2049 I believe [20:10:26] mutante: it's a bit disappointing given the amount of work you put in recently for deb packaging, but I think the fabric option will save time in the long run. Still, thanks for all the effort you put into it in the last few weeks. [20:10:52] valhallasw`cloud: i think wrong user? [20:11:13] that's what the check command uses is 2049 [20:11:39] yeah 2049 i think is right [20:12:07] there are no rules for that then [20:12:36] ha, maybe there were just established connections [20:12:41] but how can it work from another host [20:12:42] and thats why snaps are still connected ? :) [20:12:48] that would make sense [20:13:01] cause i can't telnet to that port from snap1001 [20:13:02] mutante: I thought you had built the adminbot debs the last weeks? I may be confused [20:13:14] apergos: i think you need a ferm rule for 2049 on dataset1001 [20:13:39] remember that snapshot mounts work [20:13:54] apergos: i agree, it's probably just because they were established [20:14:15] valhallasw`cloud: i did build the package but i know nothing about something else replacing .debs [20:14:31] lovely [20:14:39] mutante: ah. That's the patchset I just submitted [20:15:05] (03PS1) 10Ori.livneh: Package for Trusty [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/231152 (https://phabricator.wikimedia.org/T107405) [20:15:22] valhallasw`cloud: using different deployment systems is already a problem [20:16:04] apergos: i can help.. hold on [20:17:23] how does the icinga nfs check work then? [20:17:26] because it does [20:17:54] (03CR) 10Dzahn: "i think we should keep using puppet and debs. the actual issue is why is it installed on so many hosts that are not even exec nodes" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 (owner: 10Merlijn van Deen) [20:18:15] not saying you're wrong, just trying to reason it out [20:19:44] (03PS1) 10Dzahn: dumps: add ferm rule for port 2049 from INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/231153 [20:19:56] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532544 (10BBlack) >>! In T91347#1532207, @Ottomata wrote: > We are eventually going to need a way t... [20:20:26] apergos: runs on localhost via NRPE? [20:20:37] bah humbug. really? nrpe [20:20:38] ? [20:20:42] ok objection withdrawn [20:21:31] (03CR) 10Ori.livneh: [C: 032 V: 032] Package for Trusty [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/231152 (https://phabricator.wikimedia.org/T107405) (owner: 10Ori.livneh) [20:21:33] hmm, maybe not [20:21:42] check_tcp!2049 [20:22:11] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Package mwbzutils for Trusty - https://phabricator.wikimedia.org/T107405#1532556 (10ori) 5Open>3Resolved a:5ArielGlenn>3ori [20:23:04] apergos: it's not NRPE, but it's that neon gets a wide exemption from standard class [20:23:10] that allows all ports [20:23:53] !log deployed the latest EventLogging master to eventlog1001 [20:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:02] (03PS2) 10Dzahn: dumps: add ferm rule for port 2049 from INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/231153 [20:24:08] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532567 (10BBlack) Trying to answer this for myself, the original truncated URL from the top of this... [20:26:43] (03CR) 10Merlijn van Deen: "It always needs to be deployed on approximately 15 hosts individually, which makes deploying (and reverting if something goes wrong!) slow" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 (owner: 10Merlijn van Deen) [20:26:45] (03CR) 10ArielGlenn: [C: 031] "don't know if it's going to make any differenc ebut it won't hurt" [puppet] - 10https://gerrit.wikimedia.org/r/231153 (owner: 10Dzahn) [20:27:46] (03CR) 10Dzahn: [C: 032] dumps: add ferm rule for port 2049 from INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/231153 (owner: 10Dzahn) [20:28:03] PROBLEM - HHVM rendering on mw1061 is CRITICAL: Connection refused [20:29:02] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532575 (10Ottomata) > In a URL? What kind of data are considering eventlogging here that's so long?... [20:29:23] PROBLEM - Apache HTTP on mw1061 is CRITICAL: Connection refused [20:29:53] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532580 (10Ottomata) > A lot of this seems like it could be compressed/reduced, and/or duplicates in... [20:30:18] apergos: "df -h" non stat1002 now fast [20:30:25] apergos: try the mount again? [20:30:48] ottomata: /mnt/data looks normal [20:31:27] remount was fine, other processes seem now responsive [20:31:36] good [20:31:59] this means I need to see how that nfs check works... tomorrow when I'm awake [20:32:47] woot, thank you! [20:33:08] apergos: mutante said it was probably because icinga host has an allow all rule or somethign [20:33:24] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532587 (10Tgr) >>! In T91347#1532544, @BBlack wrote: > Also, 2000-ish is a whole lot more palatable... [20:33:46] yes I saw. but that's different than looking at the manifest myself :-) [20:34:41] starting parsoind deploy [20:34:49] (03PS2) 10Ottomata: Install python-pykafka for eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/231099 [20:34:54] (03CR) 10Ottomata: [C: 032 V: 032] Install python-pykafka for eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/231099 (owner: 10Ottomata) [20:35:04] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:35:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1532590 (10ArielGlenn) 5Open>3Resolved dzahn caught it, needed a ferm rule for 2049 (nfs) on dataset1001, that got missed when those rules were set up. puppet now runs, cstei... [20:35:54] apergos: not a mystery, just ACCEPT all -- neon.wikimedia.org anywhere [20:37:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1532595 (10Dzahn) fix was https://gerrit.wikimedia.org/r/#/c/231153/ [20:40:47] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1532605 (10Dzahn) [20:41:39] 10Ops-Access-Reviews: Analytics-users membership for csteipp - https://phabricator.wikimedia.org/T108351#1532606 (10Dzahn) 5Open>3Resolved [stat1002:~] $ id csteipp uid=2246(csteipp) gid=500(wikidev) groups=500(wikidev),731(analytics-privatedata-users) [20:41:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1515688 (10Dzahn) [20:41:43] officially clocking out for the day [20:42:27] whyyy [20:42:42] !log deployed parsoid version a271c205 [20:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:12] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1532622 (10Dzahn) 5stalled>3Open [20:44:22] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439066 (10Dzahn) to achieve this `./modules/icinga/files/cgi.cfg` needs to be edited, LDAP changes should not be needed [20:45:28] (03PS1) 10Alex Monk: Fix SVG conversion on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231158 (https://phabricator.wikimedia.org/T93041) [20:46:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1427 bytes in 0.120 second response time [20:46:43] (03PS2) 10Dzahn: admins: delete mailman-users group [puppet] - 10https://gerrit.wikimedia.org/r/230977 [20:47:17] (03CR) 10Dzahn: [C: 032] admins: delete mailman-users group [puppet] - 10https://gerrit.wikimedia.org/r/230977 (owner: 10Dzahn) [20:50:58] 6operations, 6Security: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532640 (10dpatrick) p:5Triage>3Normal a:3Krenair [20:51:17] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1408628 (10dpatrick) [20:55:03] PROBLEM - puppet last run on hafnium is CRITICAL Puppet has 1 failures [20:55:23] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532652 (10Krenair) I imagine it's still needed on tin, terbium, tmh100[12], and snapshot100[1-4] [20:56:44] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [20:57:24] RECOVERY - HHVM rendering on mw1061 is OK: HTTP OK: HTTP/1.1 200 OK - 66493 bytes in 0.294 second response time [20:58:16] (03PS1) 10Ori.livneh: Make snapshot::packages compatible with trusty [puppet] - 10https://gerrit.wikimedia.org/r/231161 [20:58:22] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532666 (10Krenair) a:5Krenair>3None But really I don't know enough about librsvg to determine this [20:58:30] (03CR) 10Ori.livneh: [C: 032 V: 032] Make snapshot::packages compatible with trusty [puppet] - 10https://gerrit.wikimedia.org/r/231161 (owner: 10Ori.livneh) [21:01:28] (03PS1) 10Ori.livneh: xenon-generate-svgs: make the palette of reversed flame graphs blue [puppet] - 10https://gerrit.wikimedia.org/r/231163 [21:01:42] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon-generate-svgs: make the palette of reversed flame graphs blue [puppet] - 10https://gerrit.wikimedia.org/r/231163 (owner: 10Ori.livneh) [21:04:39] (03PS1) 10Ori.livneh: Package for Trusty [debs/utfnormal] - 10https://gerrit.wikimedia.org/r/231165 [21:06:26] (03PS1) 10Nemo bis: [Planet Wikimedia] Add Content Translation to the English Planet [puppet] - 10https://gerrit.wikimedia.org/r/231166 [21:06:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Package for Trusty [debs/utfnormal] - 10https://gerrit.wikimedia.org/r/231165 (owner: 10Ori.livneh) [21:06:48] (03CR) 10Ori.livneh: [C: 032 V: 032] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/231166 (owner: 10Nemo bis) [21:07:28] lightning fast :) [21:10:27] matanya, hey [21:13:23] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532700 (10Krenair) Specifically, we run rsvg-convert version "2.36.1 (Wikimedia)" on precise (according to `rsvg-convert --version`). Are you asking whether we still need to do this? Since someone h... [21:15:22] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532708 (10Reedy) Presumably when all mw servers are running trusty we can remove the hack above [21:16:47] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1532710 (10Reedy) [21:16:49] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532709 (10Reedy) [21:18:44] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532729 (10BBlack) >>! In T91347#1532587, @Tgr wrote: > Actually all of that except the webHost come... [21:20:24] PROBLEM - puppet last run on fluorine is CRITICAL puppet fail [21:25:58] springle: is anything still blocking RBR for mariadb? [21:31:37] (03PS1) 10Alex Monk: Rename wmfVersionNumber to wmgVersionNumber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231169 (https://phabricator.wikimedia.org/T45956) [21:35:39] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1532810 (10Ricordisamoa) [21:39:27] (03PS1) 10Madhuvishy: [WIP] eventlogging: Add statsd_host param to the mysql consumer url [puppet] - 10https://gerrit.wikimedia.org/r/231170 (https://phabricator.wikimedia.org/T105935) [21:40:13] (03CR) 10TTO: Allow import from any WMF project to any other (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [21:43:05] (03PS4) 10TTO: Allow import from any Labs/Beta Clusterproject to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [21:44:14] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:47:08] (03CR) 10Hoo man: [C: 031] "I don't want to block this on one WikibaseClient feature." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230247 (https://phabricator.wikimedia.org/T108101) (owner: 10Legoktm) [21:53:21] (03PS1) 10Andrew Bogott: Remove network_host setting. [puppet] - 10https://gerrit.wikimedia.org/r/231177 [22:00:19] (03CR) 10Rush: [C: 031] "this seems like cross version cruft, and in the spirit of trying to narrow down cloning labnet let's remove it so reduce our variables. A" [puppet] - 10https://gerrit.wikimedia.org/r/231177 (owner: 10Andrew Bogott) [22:00:40] (03CR) 10Andrew Bogott: [C: 032] Remove network_host setting. [puppet] - 10https://gerrit.wikimedia.org/r/231177 (owner: 10Andrew Bogott) [22:03:53] (03CR) 10CSteipp: [C: 031] Isolate wikidata.org cookies and CORS policies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230247 (https://phabricator.wikimedia.org/T108101) (owner: 10Legoktm) [22:04:24] (03PS1) 10BryanDavis: beta: Disable authentication for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/231179 (https://phabricator.wikimedia.org/T76784) [22:09:54] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.183 second response time [22:14:39] (03CR) 10Greg Grossmeier: [C: 031] "Yup." [puppet] - 10https://gerrit.wikimedia.org/r/231179 (https://phabricator.wikimedia.org/T76784) (owner: 10BryanDavis) [22:15:09] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1532978 (10CaitVirtue) Yes, if it's not too much trouble caitlin@ would be great. Thanks! [22:17:27] (03CR) 10BryanDavis: "Cherry-picked and applied on beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/231179 (https://phabricator.wikimedia.org/T76784) (owner: 10BryanDavis) [22:17:32] !log Resumed the support desk conversion. [22:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:18:49] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1532989 (10BBlack) What should caitlin@ alias to in @wikimedia.org ? [22:18:51] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1532990 (10Tgr) >>! In T91347#1532729, @BBlack wrote: > Yeah, but they're there in the original requ... [22:20:14] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1532997 (10BBlack) Also, re the rest of the stuff in https://phabricator.wikimedia.org/T107940#1518032 - can we get some confirmat... [22:21:18] !log Killed support desk conversion again to review XDebug information. [22:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:16] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1533011 (10CCogdill_WMF) @BBlack I'm working on that confirmation, and will let you now! I'm guessing caitlin@benefactor.wikimedi... [22:27:14] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [22:37:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1427 bytes in 0.149 second response time [22:51:03] PROBLEM - nova-compute process on labvirt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [22:51:03] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [22:51:25] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [22:54:53] RECOVERY - nova-compute process on labvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [22:55:35] PROBLEM - nova-network process on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [22:56:04] (03PS1) 10Andrew Bogott: Revert "Remove network_host setting." [puppet] - 10https://gerrit.wikimedia.org/r/231189 [22:56:54] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [22:57:09] (03CR) 10Andrew Bogott: [C: 032 V: 032] "self-verifying because I suspect this change broke jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/231189 (owner: 10Andrew Bogott) [22:59:34] RECOVERY - nova-network process on labnet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150812T2300). [23:00:54] PROBLEM - nova-compute process on labvirt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:01:01] I have some things to add [23:01:12] but legoktm, want to go first? [23:01:21] sure [23:01:41] (03PS2) 10Legoktm: Isolate wikidata.org cookies and CORS policies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230247 (https://phabricator.wikimedia.org/T108101) [23:01:48] (03CR) 10Legoktm: [C: 032] Isolate wikidata.org cookies and CORS policies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230247 (https://phabricator.wikimedia.org/T108101) (owner: 10Legoktm) [23:01:54] (03Merged) 10jenkins-bot: Isolate wikidata.org cookies and CORS policies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230247 (https://phabricator.wikimedia.org/T108101) (owner: 10Legoktm) [23:02:01] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1533178 (10atgo) p:5Triage>3Normal [23:02:52] !log legoktm@tin Synchronized wmf-config: Isolate wikidata.org cookies and CORS policies (duration: 00m 12s) [23:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:09] * legoktm test [23:03:10] s [23:04:28] connection refused from labs [23:05:17] Krenair: ok, I'm done [23:05:43] Vito, see -labs [23:06:29] (03PS1) 10Dzahn: fermium: add rsync server to sync from sodium [puppet] - 10https://gerrit.wikimedia.org/r/231190 [23:06:49] oh Krenair, I see [23:07:48] (03PS2) 10Alex Monk: fix incorrect whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230729 (owner: 10EBernhardson) [23:07:54] (03CR) 10Alex Monk: [C: 032] fix incorrect whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230729 (owner: 10EBernhardson) [23:08:00] (03Merged) 10jenkins-bot: fix incorrect whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230729 (owner: 10EBernhardson) [23:08:18] (03CR) 10Alex Monk: [C: 04-1] Set an explicit 'wgLanguageCode' entry for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [23:08:33] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:08:44] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:09:54] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:10:45] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:11:55] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:12:34] Is toollabs do..oh..I see [23:12:53] (03CR) 10Alex Monk: [C: 032] Fix SVG conversion on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231158 (https://phabricator.wikimedia.org/T93041) (owner: 10Alex Monk) [23:13:14] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:13:24] (03Merged) 10jenkins-bot: Fix SVG conversion on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231158 (https://phabricator.wikimedia.org/T93041) (owner: 10Alex Monk) [23:13:33] PROBLEM - nova-network process on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [23:13:53] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:13:54] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:13:55] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/230729 (duration: 00m 13s) [23:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:47] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/231158/ (duration: 00m 13s) [23:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:46] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/231158/ (duration: 00m 11s) [23:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:35] hoorah: https://wikitech.wikimedia.org/w/thumb.php?f=Wikimedia_labs_logo.svg&width=600 [23:17:52] And now this works as well! https://wikitech.wikimedia.org/w/images/thumb/6/60/Wikimedia_labs_logo.svg/600px-Wikimedia_labs_logo.svg.png [23:19:43] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:19:44] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:19:44] PROBLEM - nova-compute process on labvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:20:43] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:21:34] https://wikitech.wikimedia.org/wiki/File:Wikimedia_labs_logo.svg looks so much better [23:21:44] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:21:51] that was all broken images until I pushed that patch [23:22:11] PDFs are still broken of course [23:22:34] PROBLEM - RAID on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:03] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [23:23:23] RECOVERY - nova-network process on labnet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [23:24:24] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:24:33] RECOVERY - nova-compute process on labvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:24:34] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:25:03] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:25:23] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1415 bytes in 0.184 second response time [23:25:43] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:25:43] RECOVERY - nova-compute process on labvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:26:04] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:26:34] RECOVERY - RAID on fluorine is OK Active: 10, Working: 10, Failed: 0, Spare: 0 [23:29:28] 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1533300 (10Krenair) That looks much better now. So, PDFs: ```krenair@tin:/srv/mediawiki-staging (master)$ dpkg-query -S `which pdfinfo` pop... [23:34:34] PROBLEM - Check size of conntrack table on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:34:43] PROBLEM - RAID on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:35:03] PROBLEM - SSH on fluorine is CRITICAL - Socket timeout after 10 seconds [23:35:23] PROBLEM - puppet last run on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:36:09] (03CR) 10Alex Monk: "Did we get this checked by someone who speaks the language?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228477 (https://phabricator.wikimedia.org/T104132) (owner: 10Dereckson) [23:36:33] 7Blocked-on-Operations, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current, and 2 others: Separate reference tables by wiki - https://phabricator.wikimedia.org/T107204#1533360 (10Catrope) [23:36:38] (03CR) 10Alex Monk: "Did we get this checked by someone who speaks the language?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231074 (https://phabricator.wikimedia.org/T104132) (owner: 10Dereckson) [23:37:04] RECOVERY - SSH on fluorine is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [23:37:23] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 44 minutes ago with 0 failures [23:38:34] RECOVERY - Check size of conntrack table on fluorine is OK nf_conntrack is 0 % full [23:41:34] PROBLEM - configured eth on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:41:43] PROBLEM - dhclient process on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:41:44] PROBLEM - salt-minion processes on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:41:59] (03CR) 10Alex Monk: [C: 032] Rename wmfVersionNumber to wmgVersionNumber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231169 (https://phabricator.wikimedia.org/T45956) (owner: 10Alex Monk) [23:42:25] (03Merged) 10jenkins-bot: Rename wmfVersionNumber to wmgVersionNumber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231169 (https://phabricator.wikimedia.org/T45956) (owner: 10Alex Monk) [23:42:33] RECOVERY - RAID on fluorine is OK Active: 10, Working: 10, Failed: 0, Spare: 0 [23:44:37] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/231169/ (duration: 00m 12s) [23:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:24] RECOVERY - configured eth on fluorine is OK - interfaces up [23:45:33] RECOVERY - salt-minion processes on fluorine is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:45:33] RECOVERY - dhclient process on fluorine is OK: PROCS OK: 0 processes with command name dhclient [23:45:50] wtf? [23:46:18] ? [23:47:10] !log krenair@tin Synchronized php-1.26wmf18/extensions/WikimediaMaintenance: https://gerrit.wikimedia.org/r/#/c/231193/ (duration: 00m 12s) [23:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:26] (03CR) 10Jforrester: "It transliterates to "Vikibuksa", I think." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231074 (https://phabricator.wikimedia.org/T104132) (owner: 10Dereckson) [23:47:46] ori, the list of commits waiting to be deployed on wmf17 [23:48:04] Some of them are awight's DonationInterface backports [23:48:05] however [23:48:48] There's also kaldari's addition and removal of WikidataPageBanner [23:54:00] I just ignored them [23:54:43] PROBLEM - DPKG on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:52] !log krenair@tin Synchronized php-1.26wmf17/extensions/WikimediaMaintenance: https://gerrit.wikimedia.org/r/#/c/231194/ (duration: 00m 12s) [23:54:58] !log fluorine is struggling due to I941660b5; I'm fixing. [23:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:43] James_F, https://translate.google.com/#auto/en/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AC%E0%A5%81%E0%A4%95%E0%A5%8D%E0%A4%B8 [23:57:00] I'm kind of surprised google got that [23:57:13] Now I wonder how much to trust google translate on this :p [23:57:24] PROBLEM - salt-minion processes on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:24] PROBLEM - dhclient process on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:13] PROBLEM - Check size of conntrack table on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:14] PROBLEM - RAID on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:30] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1533500 (10CaitVirtue) @CCogdill_WMF to cvirtue@wikimedia,org please [23:59:03] PROBLEM - puppet last run on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:14] PROBLEM - configured eth on fluorine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:26] (03CR) 10Alex Monk: "https://translate.google.com/#auto/en/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AC%E0%A5%81%E0%A4%95%E0%A5%8D%E0%A4%B8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231074 (https://phabricator.wikimedia.org/T104132) (owner: 10Dereckson)