[01:20:33] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: Puppet has 1 failures [01:47:22] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:05] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 06m 44s) [02:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:42] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-11-02 02:25:42+00:00 [02:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:19:13] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.54% of data above the critical threshold [100000000.0] [04:28:23] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 1 failures [04:44:08] (03PS1) 10Ori.livneh: tune gitblit settings to improve performance [puppet] - 10https://gerrit.wikimedia.org/r/250369 [04:44:45] (03CR) 10Ori.livneh: [C: 032 V: 032] "This service has been all but abandoned, so no one should mind if I give it a crack." [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [04:47:52] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [04:55:12] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:07:33] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [05:09:03] RECOVERY - Restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [05:10:12] (03PS1) 1020after4: iridium system-wide gitconfig needs http.proxy [puppet] - 10https://gerrit.wikimedia.org/r/250370 [05:11:11] (03CR) 1020after4: "I thought I already added this once before but somehow it's been undone." [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [05:13:03] RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [05:15:03] PROBLEM - Disk space on logstash1005 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 113359 MB (3% inode=99%) [05:25:13] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [05:33:07] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Nov 2 05:33:07 UTC 2015 (duration 33m 6s) [05:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:45:17] (03CR) 1020after4: iridium system-wide gitconfig needs http.proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [05:59:12] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:30:43] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:13] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:42] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:03] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:13] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:14] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:42] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:02] RECOVERY - Disk space on logstash1005 is OK: DISK OK [06:56:42] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:50] (03PS1) 10Gergő Tisza: Use $::mail_smarthost for SMTP_HOST [puppet] - 10https://gerrit.wikimedia.org/r/250373 (https://phabricator.wikimedia.org/T116709) [06:57:33] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:55:28] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#1772716 (10Qgil) [08:06:25] (03CR) 10Muehlenhoff: "I suppose that should rather be included in the analytics::kafka::server class." [puppet] - 10https://gerrit.wikimedia.org/r/250068 (owner: 10Dzahn) [08:15:47] (03CR) 10Muehlenhoff: [C: 031] snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 (owner: 10Dzahn) [08:24:27] legoktm: https://github.com/JuanPotato/Legofy [08:25:48] (03CR) 10Muehlenhoff: [C: 04-1] "I wouldn't change that one, since it's just a temporary measure and not part of the standard role." [puppet] - 10https://gerrit.wikimedia.org/r/250076 (owner: 10Dzahn) [08:26:21] (03CR) 10Jcrespo: [C: 031] mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 (owner: 10Dzahn) [08:50:51] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1772761 (10mobrovac) [08:50:53] 6operations, 10RESTBase, 6Services: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1772759 (10mobrovac) 5Open>3Resolved Truncating the `mobile-html` data CF and a restart did the trick. [08:52:13] (03PS2) 10Giuseppe Lavagetto: maintenance: move update special pages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249706 (https://phabricator.wikimedia.org/T116728) [08:54:49] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1772763 (10jcrespo) 5Resolved>3Open @ori, I do not know that you did here, but you did not delete the views: ``` mysql -BN -A -h labsdb1001.eqiad.wmnet enwiki_p -e "SHOW TABLES l... [08:57:04] (03CR) 10Filippo Giunchedi: [C: 04-1] swift: no "if $hostname" in node blocks, use role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250072 (owner: 10Dzahn) [08:57:07] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1772769 (10ori) >>! In T115711#1772763, @jcrespo wrote: > @ori, I do not know that you did here Sigh. This has to do with my pathological tendency to confuse labs with beta cluster. I... [08:57:22] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [08:57:41] !log restbase deploying 8036232 [08:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:12] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [08:59:53] what's this ^^ ? [09:00:11] curl -v localhost:1970/api works on sca100x [09:00:59] <_joe_> uhm seems like citoid refused connections for an interval of time [09:01:02] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [09:01:03] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:01:17] <_joe_> mobrovac: the problem was gone by the time you checked [09:02:07] ah kk [09:02:13] thnx _joe_ [09:02:19] (03PS1) 10Ori.livneh: Restore ability to clone from gitblit [puppet] - 10https://gerrit.wikimedia.org/r/250377 (https://phabricator.wikimedia.org/T117390) [09:02:59] (03CR) 10Ori.livneh: [C: 032] Restore ability to clone from gitblit [puppet] - 10https://gerrit.wikimedia.org/r/250377 (https://phabricator.wikimedia.org/T117390) (owner: 10Ori.livneh) [09:04:34] (03PS1) 10Jcrespo: Delete user_daily_contribs from the views in labs [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) [09:04:38] !sal [09:04:38] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [09:07:28] !log deleting user_daily_contribs view on labs (it points to a non-existent table) [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:35] jynus: sorry about the misunderstanding on that task [09:11:47] nothing to say sorry, just leave those kind of maintenance to me, I have several scripts that allow me to do it with a single click [09:12:03] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:12:09] (03PS1) 10Filippo Giunchedi: install_server: prepend 'deb' to internal repo [puppet] - 10https://gerrit.wikimedia.org/r/250380 (https://phabricator.wikimedia.org/T94177) [09:15:33] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:16:12] <_joe_> mobrovac: I'd say something is definitely going on with citoid [09:16:24] <_joe_> didn't you feel better when we were in the dark? [09:16:52] <_joe_> https://arslank.files.wordpress.com/2012/10/ignorance-is-bliss1.jpg?w=300&h=300 [09:17:11] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: move update special pages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249706 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [09:17:18] (03PS3) 10Giuseppe Lavagetto: maintenance: move update special pages off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249706 (https://phabricator.wikimedia.org/T116728) [09:18:35] hehe _joe_ [09:18:54] which monkey are you? [09:18:55] logstash for Citoid reports warnings such as "Maximum call stack size exceeded" and "Invalid host supplied" [09:18:55] :P [09:19:01] and "Maximum number of allowed redirects reached" .. [09:19:05] (03PS1) 10Alexandros Kosiaris: maps: long description about tileratorui's statefulness [puppet] - 10https://gerrit.wikimedia.org/r/250382 [09:19:29] though that is only 27 events over the last six hours... [09:19:39] (03PS2) 10Giuseppe Lavagetto: maintenance: move tor jobs off of terbium [puppet] - 10https://gerrit.wikimedia.org/r/249707 [09:19:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249707 (owner: 10Giuseppe Lavagetto) [09:20:53] <_joe_> mobrovac: I'm a fusion of the three: I don't see,, don't listen I don't talk, and if I were there, I were asleep [09:20:54] hashar: these are ok (apart from the maximum call stack size) [09:23:07] zotero seems to be acting up [09:23:11] *sigh* [09:24:14] !log citoid restarted zotero on sca100x [09:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:24:58] <_joe_> mobrovac: thanks for looking into it [09:25:08] /66/10 [09:25:08] np [09:30:33] (03CR) 10Paladox: tune gitblit settings to improve performance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [09:35:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 918 [09:36:03] (03PS1) 10Nemo bis: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) [09:39:06] <_joe_> !log moved tor exit node, update_special_pages off of terbium to mw1152 [09:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:12] <_joe_> Nemo_bis: [09:39:22] <_joe_> uh sorry, tab fail [09:39:42] <_joe_> I was about to ask you to tell me if anyone complains about torblock not working properly [09:39:51] <_joe_> I tested the transition but well... [09:40:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 8700740 Threads: 1 Questions: 102259430 Slow queries: 58984 Opens: 100889 Flush tables: 2 Open tables: 64 Queries per second avg: 11.752 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:41:07] _joe_: I did not hear anything. Usually the best way to check whether someone is unhappy with torblock is to verify whether there was a spike in (manual) blocks by sysops under [[m:NOP]] [09:41:26] <_joe_> Nemo_bis: ok [09:41:43] <_joe_> Nemo_bis: well if anyone is not happy, I assume we will notice in a few hours [09:41:47] <_joe_> not right away [09:41:53] <_joe_> thanks a lot! [09:42:19] <_joe_> again, I have no reason to think this won't work correctly [09:44:22] (03Abandoned) 10Alexandros Kosiaris: Revert "maps: Add tileratorui service" [puppet] - 10https://gerrit.wikimedia.org/r/249697 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [09:46:26] (03CR) 10Alexandros Kosiaris: [C: 031] "After discussing it in the above change and staying with the current state of affairs, I am +1ing this. Needs an OK from ops meeting thoug" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) (owner: 10Yurik) [09:47:15] (03PS2) 10Alexandros Kosiaris: maps: long description about tileratorui's statefulness [puppet] - 10https://gerrit.wikimedia.org/r/250382 [09:48:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: long description about tileratorui's statefulness [puppet] - 10https://gerrit.wikimedia.org/r/250382 (owner: 10Alexandros Kosiaris) [09:49:05] (03CR) 10Muehlenhoff: [C: 031] install_server: prepend 'deb' to internal repo [puppet] - 10https://gerrit.wikimedia.org/r/250380 (https://phabricator.wikimedia.org/T94177) (owner: 10Filippo Giunchedi) [09:49:33] (03CR) 10Gilles: [C: 04-1] "Undesirable for security reasons, as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/250291 (https://phabricator.wikimedia.org/T35186) (owner: 10Ori.livneh) [09:52:36] 6operations, 10MediaWiki-JobRunner, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1772839 (10jcrespo) 3NEW [10:01:37] (03PS1) 10Alexandros Kosiaris: Remove pollux's mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) [10:10:05] !log powering off pollux. migrating to a VM [10:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:23] 6operations, 10Wikidata, 7Database, 7Performance: number of database updates multiplied x3 since 29 November - https://phabricator.wikimedia.org/T117398#1772889 (10jcrespo) 3NEW [10:16:27] 6operations, 10Wikidata, 7Database, 7Performance: number of database updates multiplied x3 since 29 November - https://phabricator.wikimedia.org/T117398#1772896 (10ori) [10:18:55] (03PS2) 10Filippo Giunchedi: install_server: prepend 'deb' to internal repo [puppet] - 10https://gerrit.wikimedia.org/r/250380 (https://phabricator.wikimedia.org/T94177) [10:19:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: prepend 'deb' to internal repo [puppet] - 10https://gerrit.wikimedia.org/r/250380 (https://phabricator.wikimedia.org/T94177) (owner: 10Filippo Giunchedi) [10:20:56] 6operations, 10Wikidata, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1772913 (10akosiaris) [10:25:02] 6operations, 10vm-requests: Site: 2 VM request for OIT LDAP mirror - https://phabricator.wikimedia.org/T117183#1772916 (10akosiaris) pollux VM done [10:25:35] (03PS1) 10Alexandros Kosiaris: pollux: migrate into a VM [puppet] - 10https://gerrit.wikimedia.org/r/250388 (https://phabricator.wikimedia.org/T117183) [10:26:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [10:31:19] (03CR) 10Alexandros Kosiaris: [C: 032] pollux: migrate into a VM [puppet] - 10https://gerrit.wikimedia.org/r/250388 (https://phabricator.wikimedia.org/T117183) (owner: 10Alexandros Kosiaris) [10:31:27] (03PS2) 10Alexandros Kosiaris: pollux: migrate into a VM [puppet] - 10https://gerrit.wikimedia.org/r/250388 (https://phabricator.wikimedia.org/T117183) [10:31:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:33:29] <_joe_> https://grafana.wikimedia.org/dashboard/db/varnish-http-errors is awesomely better than what is on gdash [10:33:42] <_joe_> I think we should make that page a redirect to this one [10:34:19] (03CR) 10Alexandros Kosiaris: [V: 032] pollux: migrate into a VM [puppet] - 10https://gerrit.wikimedia.org/r/250388 (https://phabricator.wikimedia.org/T117183) (owner: 10Alexandros Kosiaris) [10:34:44] (03PS1) 10Jcrespo: Monitor x1 replication on dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/250392 (https://phabricator.wikimedia.org/T75047) [10:36:49] PROBLEM - Host ms-be1019 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:34] that's me ^ [10:38:30] RECOVERY - Host ms-be1019 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [10:43:17] (03PS1) 10Alexandros Kosiaris: pollux: Reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/250394 [10:43:21] (03CR) 10jenkins-bot: [V: 04-1] pollux: Reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/250394 (owner: 10Alexandros Kosiaris) [10:43:43] (03PS2) 10Alexandros Kosiaris: pollux: Reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/250394 [10:43:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] pollux: Reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/250394 (owner: 10Alexandros Kosiaris) [10:44:00] (03PS1) 10Giuseppe Lavagetto: gdash: redirect reqstats page to grafana [puppet] - 10https://gerrit.wikimedia.org/r/250395 [10:47:50] <_joe_> reviews welcome ^^ [10:50:28] (03CR) 10Jcrespo: [C: 031] "I'm ok with it, but I would also understand if anyone may object and would prefer just a link for now." [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto) [10:51:10] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:52:00] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [10:52:21] !log Restarting Jenkins to upgade Gearman plugin from 0.1.2 to 0.1.3 "Send node labels back on build completion" [10:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:59] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:55:49] <_joe_> ok I'll take a look at zoter [10:55:59] <_joe_> mobrovac: what was up with it earlier? [11:02:40] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:04:55] _joe_: I am going to venture a guess. Someone is trying to cite http://example.org or http://youtube.org [11:05:41] <_joe_> akosiaris: actually zotero is working fine now [11:05:52] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:05:54] <_joe_> well, "fine" by its standards, ofc [11:07:40] <_joe_> akosiaris: the sad thing is that citoid itself doesn't seem to record any failure of any type [11:08:29] zotero(3)(+0000019): Translators: Looking for translators for http://example.com/ [11:08:36] and beats me when that happened [11:08:38] cause zotero [11:09:28] _joe_: that's at least fixable by us [11:09:30] zotero isn't [11:10:32] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:11:30] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:16:10] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:16:23] oh come on [11:17:51] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:18:41] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:18:50] (03CR) 10Alex Monk: iridium system-wide gitconfig needs http.proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [11:19:31] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1773011 (10Krenair) [11:22:08] (03PS2) 10Jcrespo: Monitor x1 replication on dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/250392 (https://phabricator.wikimedia.org/T75047) [11:23:01] hm _joe_, we might want to increase the timeout for the service checker, one of the examples that uses zotero takes 30s to complete [11:23:30] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:24:13] (03PS2) 10Addshore: Retain daily.* graphite metrics for longer [puppet] - 10https://gerrit.wikimedia.org/r/247866 [11:24:20] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:25:09] (03PS3) 10Addshore: Retain daily.* graphite metrics for longer [puppet] - 10https://gerrit.wikimedia.org/r/247866 [11:27:53] (03PS4) 10Addshore: Retain daily.* graphite metrics for longer [puppet] - 10https://gerrit.wikimedia.org/r/247866 (https://phabricator.wikimedia.org/T117402) [11:30:16] (03PS3) 10Jcrespo: Monitor x1 replication on dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/250392 (https://phabricator.wikimedia.org/T75047) [11:30:31] (03CR) 10Jcrespo: [C: 032] Monitor x1 replication on dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/250392 (https://phabricator.wikimedia.org/T75047) (owner: 10Jcrespo) [11:36:17] (03CR) 10Filippo Giunchedi: [C: 031] gdash: redirect reqstats page to grafana [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto) [11:36:28] <_joe_> mobrovac: argh, we might want to change the example [11:37:19] _joe_: we could try, i'm not sure that wouldn't be needed either way [11:37:29] it's zotero's mysterious ways [11:38:03] 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1773074 (10fgiunchedi) 5Open>3Resolved all three machines provisioned [11:38:39] we are running low on disk space on some logstash hosts [11:47:04] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:49:08] 6operations, 7Database, 5Patch-For-Review: Replicate flowdb from X1 to analytics-store - https://phabricator.wikimedia.org/T75047#1773140 (10jcrespo) 5Open>3Resolved [11:52:52] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:02:16] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:07:25] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:07:51] (03PS1) 10Glaisher: Set $wgCategoryCollation to uca-sr at srprojects (-wikipedia, wiktionary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250401 (https://phabricator.wikimedia.org/T115806) [12:11:35] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:16:36] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:17:15] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [12:19:08] (03PS1) 10Faidon Liambotis: Drain ulsfo, planned on-site maintainance [dns] - 10https://gerrit.wikimedia.org/r/250405 (https://phabricator.wikimedia.org/T116928) [12:19:21] !log mobileapps deploying 59e8c30 [12:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:20:46] YuviPanda: please take a look at https://gerrit.wikimedia.org/r/#/c/238778/ when you have the time [12:20:55] I have updated the patch according to your comments [12:22:26] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1773189 (10fgiunchedi) @akosiaris that sounds good too. If I'm understanding correctly the normal state would be as it is now, however to prevent puppet from star... [12:25:55] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:27:27] PROBLEM - salt-minion processes on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:27:36] PROBLEM - salt-minion processes on analytics1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:27:46] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:28:06] PROBLEM - salt-minion processes on mw2061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:06] PROBLEM - salt-minion processes on ganeti2006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:16] PROBLEM - salt-minion processes on ms-be2015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:16] PROBLEM - salt-minion processes on mw1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:26] PROBLEM - salt-minion processes on db2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:26] PROBLEM - salt-minion processes on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:36] PROBLEM - salt-minion processes on mc2014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:55] PROBLEM - salt-minion processes on mw1246 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:56] PROBLEM - salt-minion processes on mw1188 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:29:25] PROBLEM - salt-minion processes on mw1144 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:29:56] RECOVERY - salt-minion processes on mw2061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:30:06] RECOVERY - salt-minion processes on ms-be2015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:30:25] RECOVERY - salt-minion processes on analytics1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:30:49] RECOVERY - salt-minion processes on mw1246 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:33:02] RECOVERY - salt-minion processes on ganeti2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:33:24] 6operations, 10Incident-20150422-LabsOutage, 5Patch-For-Review: packages not upgraded post-install - https://phabricator.wikimedia.org/T94177#1773203 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi bug was due to how `base-installer` in trusty handles `apt-setup/local0/repository` (doesn't add a `deb ` pr... [12:33:45] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1773210 (10akosiaris) >>! In T103134#1773189, @fgiunchedi wrote: > @akosiaris that sounds good too. If I'm understanding correctly the normal state would be as it... [12:34:06] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:35:02] RECOVERY - salt-minion processes on scandium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:35:23] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 1 failures [12:35:40] PROBLEM - salt-minion processes on pollux is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:36:01] RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:37:02] PROBLEM - Corp OIT LDAP Mirror on pollux is CRITICAL: Could not bind to the LDAP server [12:37:22] ^that's me [12:37:30] damn race [12:38:00] race? :) [12:38:25] just reinstalled it [12:38:30] as a VM on jessie [12:39:33] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:39:53] ACKNOWLEDGEMENT - Corp OIT LDAP Mirror on pollux is CRITICAL: Could not bind to the LDAP server alexandros kosiaris non-affecting anything, reinstalled pollux, investigating [12:40:22] and there's you acknowledgement page so you can relax, guys [12:41:25] while you're there [12:41:40] can you rename ldap-mirror.wikimedia.org to something that's more descriptive? [12:41:42] RECOVERY - salt-minion processes on analytics1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:42:17] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1773216 (10zeljkofilipin) a:5zeljkofilipin>3None [12:42:49] paravoid: I 've got this already https://gerrit.wikimedia.org/r/#/c/249828/ [12:43:11] paravoid: and a change in puppet for using that. want to suggest an alternative ? [12:43:12] RECOVERY - salt-minion processes on mw1188 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:43:20] oh that's nice too [12:43:26] but also [12:43:30] it's the *corp* mirror [12:43:36] ldap-corp-mirror ? [12:43:41] yeah or something like that [12:43:46] going once [12:43:49] going twice [12:43:49] or even ldap-corp.eqiad [12:44:07] going once [12:44:09] going twice [12:44:14] but I don't particularly care, as long as it's not confusing compared to the other ldap [12:44:25] done!!! ldap-corp sold to the gentleman in the paravoid costume :P [12:44:32] lol [12:44:56] I don't care either. let's just go for ldap-corp. sounds fine to me [12:45:22] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:45:27] great [12:45:28] wfm [12:45:40] what's with the citoid alerts? [12:45:51] I think mobrovac and _joe_ are on top of it [12:46:01] k [12:46:17] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1773220 (10mobrovac) We used `systemctl mask cassandra` during the kernel upgrade on restbase100[1-6] and that worked pretty nicely. However, the trick here is th... [12:46:40] yeah paravoid, just ignore citoid [12:46:42] <_joe_> paravoid: yup it's zotero timing out on a check, because of its own backend call to an external service [12:47:00] ok! :) [12:47:01] <_joe_> so I suggested we disable that specific check in the spec [12:47:02] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:47:04] _joe_: should we jsut ack it? [12:47:15] <_joe_> mobrovac: it's flapping, so it would come back [12:47:20] _joe_: disable the check entirely? [12:47:25] for that endpoint [12:47:28] <_joe_> I might schedule a downtime for it [12:48:25] what actually needs to be disabled permanently is zotero, but don't get me started... [12:48:38] haha [12:48:52] zotero and akosiaris - love at first sight [12:49:51] RECOVERY - salt-minion processes on mc2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:50:52] RECOVERY - salt-minion processes on mw1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:51:11] RECOVERY - salt-minion processes on db2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:56:21] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [13:10:36] (03PS1) 10Faidon Liambotis: mirrors: switch from ftp2 to ftp3.us.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/250413 [13:10:54] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: switch from ftp2 to ftp3.us.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/250413 (owner: 10Faidon Liambotis) [13:14:12] PROBLEM - salt-minion processes on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:15:01] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [13:16:39] (03PS2) 10Alexandros Kosiaris: ldap-mirror: Add per DC CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/249828 [13:16:41] (03PS2) 10Alexandros Kosiaris: ldap-mirror: Remove the old non-DC specific CNAME [dns] - 10https://gerrit.wikimedia.org/r/249829 [13:16:43] (03PS1) 10Alexandros Kosiaris: ldap-mirror: Remove the old DC specific names [dns] - 10https://gerrit.wikimedia.org/r/250415 [13:18:41] PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: puppet fail [13:18:52] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [13:20:41] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [13:22:23] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [13:28:38] (03PS1) 10TTO: Close wikimania2014wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) [13:31:43] (03CR) 10TTO: "Please note, I am rarely available during SWAT windows, so if this is to be deployed at SWAT, it will have to be owned by someone else." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) (owner: 10TTO) [13:33:01] (03CR) 10Alex Monk: "you're going to reuse the hostname pollux on another machine?" [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) (owner: 10Alexandros Kosiaris) [13:33:52] (03CR) 10Faidon Liambotis: [C: 032] Drain ulsfo, planned on-site maintainance [dns] - 10https://gerrit.wikimedia.org/r/250405 (https://phabricator.wikimedia.org/T116928) (owner: 10Faidon Liambotis) [13:34:12] !log draining ulsfo, T116928 [13:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:42] RECOVERY - salt-minion processes on scandium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:34:50] (03CR) 10Alex Monk: "I'm sure that's fine, testing closure of a wiki isn't hard." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) (owner: 10TTO) [13:35:43] (03CR) 10Faidon Liambotis: [C: 04-1] Labs instance subnet allocation for Codfw (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [13:37:12] (03CR) 10Faidon Liambotis: "…and a couple more pedantic ones :)" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [13:38:44] (03CR) 10Faidon Liambotis: "I'm not sure if by "codfw labs row a reservations overlap" you meant that you're fixing the overlap (the commit message isn't very descrip" [dns] - 10https://gerrit.wikimedia.org/r/249906 (owner: 10Rush) [13:42:46] (03PS3) 10Alexandros Kosiaris: ldap-mirror: Add per DC CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/249828 [13:42:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ldap-mirror: Add per DC CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/249828 (owner: 10Alexandros Kosiaris) [13:44:24] (03CR) 10Faidon Liambotis: "I think with Ori's changes last week we can stop using Ganglia altogether for this and keep this in Graphite only (so we'd kill this entir" [puppet] - 10https://gerrit.wikimedia.org/r/249345 (owner: 10Dzahn) [13:44:55] (03PS1) 10Alexandros Kosiaris: role::openldap::corp: move into role module [puppet] - 10https://gerrit.wikimedia.org/r/250417 [13:44:57] (03PS1) 10Alexandros Kosiaris: ldap-corp: Populate the per DC certificates [puppet] - 10https://gerrit.wikimedia.org/r/250418 [13:44:59] (03PS1) 10Alexandros Kosiaris: ldap-corp: Instruct openldap certificate usage based on DC [puppet] - 10https://gerrit.wikimedia.org/r/250419 [13:45:01] (03PS1) 10Alexandros Kosiaris: ldap-mirror: Remove the vary on DC name to complete the migration [puppet] - 10https://gerrit.wikimedia.org/r/250420 [13:45:36] 6operations, 5Continuous-Integration-Scaling: Allow network flow between labs instance and scandium - https://phabricator.wikimedia.org/T116975#1773382 (10hashar) a:5hashar>3chasemp That follows up a discussion with @chasemp . Not sure exactly who can handle the firewall rules. I would be happy to do a vi... [13:46:41] RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:47:07] (03CR) 10jenkins-bot: [V: 04-1] ldap-corp: Instruct openldap certificate usage based on DC [puppet] - 10https://gerrit.wikimedia.org/r/250419 (owner: 10Alexandros Kosiaris) [13:48:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM. http://puppet-compiler.wmflabs.org/1162/ says noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/250417 (owner: 10Alexandros Kosiaris) [13:50:23] (03PS5) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) [14:01:51] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1773472 (10fgiunchedi) ack, thanks @akosiaris! @mobrovac true, however cassandra starting on boot is an issue only if `gc_grace` time has passed, a reboot for ker... [14:04:23] akosiaris: lda-mirror? :) [14:04:23] (03CR) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki) [14:07:08] (03PS1) 10Jcrespo: Adding changes found on actual script run from springle's home [software] - 10https://gerrit.wikimedia.org/r/250425 [14:08:50] (03CR) 10Ottomata: "?" [puppet] - 10https://gerrit.wikimedia.org/r/250068 (owner: 10Dzahn) [14:09:32] (03PS2) 10Alexandros Kosiaris: ldap-corp: Instruct openldap certificate usage based on DC [puppet] - 10https://gerrit.wikimedia.org/r/250419 [14:09:34] (03PS2) 10Alexandros Kosiaris: ldap-corp: Populate the per DC certificates [puppet] - 10https://gerrit.wikimedia.org/r/250418 [14:09:36] (03PS2) 10Alexandros Kosiaris: ldap-mirror: Remove the vary on DC name to complete the migration [puppet] - 10https://gerrit.wikimedia.org/r/250420 [14:15:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM and so does to puppet compiler http://puppet-compiler.wmflabs.org/1163/" [puppet] - 10https://gerrit.wikimedia.org/r/250418 (owner: 10Alexandros Kosiaris) [14:15:44] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1773504 (10mobrovac) >>! In T103134#1773472, @fgiunchedi wrote: > @mobrovac true, however cassandra starting on boot is an issue only if `gc_grace` time has passe... [14:16:33] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1773505 (10Cmjohnson) 5Open>3Resolved aqs1001 has been up for several days and temps are holding...resolving the ticket cmjohnson@aqs1001:~$ sudo cat /sys/class/thermal/thermal_zone*/temp... [14:23:27] (03CR) 10Ottomata: [C: 031] Change all mysql servers to max_allowed_package = 32MB [puppet] - 10https://gerrit.wikimedia.org/r/249135 (owner: 10Jcrespo) [14:24:25] (03CR) 10Ottomata: "Hm, these role classes use inheritance, could that be the problem?" [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [14:24:32] (03PS3) 10Alexandros Kosiaris: ldap-corp: Instruct openldap certificate usage based on DC [puppet] - 10https://gerrit.wikimedia.org/r/250419 [14:24:34] (03PS3) 10Alexandros Kosiaris: ldap-mirror: Remove the vary on DC name to complete the migration [puppet] - 10https://gerrit.wikimedia.org/r/250420 [14:27:47] (03PS1) 10Jcrespo: Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 [14:28:43] (03CR) 10jenkins-bot: [V: 04-1] Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 (owner: 10Jcrespo) [14:29:55] fails due to missing depency, which is a good thing [14:30:14] (I need to apply both commits at the same time) [14:30:22] (03PS2) 10ArielGlenn: dumpadmin script: add "rerun" which reruns a broken job [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249754 [14:30:24] (03PS1) 10ArielGlenn: dumps: fix up an import of a class now in a module [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250431 [14:30:26] (03PS1) 10ArielGlenn: dumpadmin: mark a job for wiki latest run as done or failed [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250432 [14:30:28] (03PS1) 10ArielGlenn: dumps: move Runner, DumpItemList out of command line script to module [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250433 [14:30:43] (03CR) 10Jcrespo: [C: 032] Set MariaDB 10 as the default version when using WMF packages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 (owner: 10Jcrespo) [14:30:47] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1773519 (10fgiunchedi) ok I've uploaded `oslo-sphinx` `python-oslotest` and `openstack-pkg-tools` (build-dep for oslo-sphinx) to `jessie-wikimed... [14:31:12] (03PS2) 10Jcrespo: Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 [14:31:31] (03CR) 10Alexandros Kosiaris: [C: 032] "http://puppet-compiler.wmflabs.org/1165/ says OK, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/250419 (owner: 10Alexandros Kosiaris) [14:31:55] please do not merge my last change on palladium [14:33:07] (03CR) 10jenkins-bot: [V: 04-1] Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 (owner: 10Jcrespo) [14:34:01] RECOVERY - Corp OIT LDAP Mirror on pollux is OK: LDAP OK - 0.109 seconds response time [14:34:28] and there's you everything is OK page [14:34:32] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:36] s/you/your/ [14:34:40] (03PS3) 10Jcrespo: Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 [14:35:20] !log citoid deployed 70abb90 [14:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:32] !log uploaded to apt.wikimedia.org trusty-wikimedia: apertium-br-fr_0.5.0+svn~57870-1 [14:36:35] kart_: ^ [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:59] (03PS4) 10Jcrespo: Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 [14:41:07] I do not trust the change, I am going to revert [14:41:43] (03CR) 10Jcrespo: [C: 04-1] Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 (owner: 10Jcrespo) [14:41:58] (03PS1) 10Jcrespo: Revert "Set MariaDB 10 as the default version when using WMF packages" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/250436 [14:42:21] (03CR) 10Jcrespo: [C: 032] Revert "Set MariaDB 10 as the default version when using WMF packages" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/250436 (owner: 10Jcrespo) [14:43:20] !log shutting down analytics1038 to run fsck on root filesystem. Currently it is read only [14:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:08] (03CR) 10Alexandros Kosiaris: [C: 032] "yes. Not only the hostname, the IP as well. Already did in fact. It ws way faster to do that as pollux's IP is in some firewall ACL's in O" [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) (owner: 10Alexandros Kosiaris) [14:45:20] (03CR) 1020after4: [C: 04-1] "Alex: Thanks, it must be something I did wrong with the gitconfig then." [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [14:46:21] (03PS2) 10Alexandros Kosiaris: Remove pollux's mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) [14:46:29] (03CR) 10Alexandros Kosiaris: [V: 032] Remove pollux's mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) (owner: 10Alexandros Kosiaris) [14:47:50] 6operations, 10ops-codfw: return pollux to spares - https://phabricator.wikimedia.org/T117423#1773610 (10akosiaris) 3NEW [14:48:06] (03PS3) 10Alexandros Kosiaris: Remove pollux's mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) [14:48:12] (03CR) 10Alexandros Kosiaris: [V: 032] Remove pollux's mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/250387 (https://phabricator.wikimedia.org/T117182) (owner: 10Alexandros Kosiaris) [14:53:14] (03PS3) 10Jcrespo: Change all mysql servers to max_allowed_package = 32MB [puppet] - 10https://gerrit.wikimedia.org/r/249135 [14:54:15] (03CR) 10Jcrespo: [C: 032] Change all mysql servers to max_allowed_package = 32MB [puppet] - 10https://gerrit.wikimedia.org/r/249135 (owner: 10Jcrespo) [14:55:37] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1773652 (10akosiaris) DNS entries and switchport configuration removal done. [14:55:41] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1773653 (10hashar) Nice, seems that fixed it :-} I backported python-os-client-config_1.6.0-4 using: ``` git clone git://anonscm.debian.org/o... [14:55:47] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1773654 (10akosiaris) [14:56:08] godog: I managed to backport python-os-client-config :-} https://phabricator.wikimedia.org/T104967#1773653 [14:57:46] (03CR) 10Jcrespo: "-1 because even if the CI said ok after merging the operations/mariadb repo, the puppet compiler said:" [puppet] - 10https://gerrit.wikimedia.org/r/250428 (owner: 10Jcrespo) [14:59:40] (03CR) 10Jcrespo: [C: 032] Adding changes found on actual script run from springle's home [software] - 10https://gerrit.wikimedia.org/r/250425 (owner: 10Jcrespo) [14:59:47] (03CR) 10Jcrespo: [V: 032] Adding changes found on actual script run from springle's home [software] - 10https://gerrit.wikimedia.org/r/250425 (owner: 10Jcrespo) [15:00:38] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1773672 (10akosiaris) [15:04:10] (03PS2) 10Alexandros Kosiaris: exim: Add and use $::other_site to provide LDAP fallback [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) [15:04:12] (03PS1) 10Alexandros Kosiaris: exim: removal of non-DC aware ldap-mirror CNAME [puppet] - 10https://gerrit.wikimedia.org/r/250438 [15:04:54] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1773704 (10hashar) Once done, we can backport `python-shade` (T107267) [15:14:08] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1773750 (10akosiaris) [15:14:39] (03PS1) 10Filippo Giunchedi: cassandra: multi-instance aware CQL checks [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) [15:20:17] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) (owner: 10Filippo Giunchedi) [15:22:22] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:22:48] (03CR) 10Filippo Giunchedi: "looks like this is DTRT https://puppet-compiler.wmflabs.org/1168/restbase-test2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) (owner: 10Filippo Giunchedi) [15:30:26] (03CR) 10Jforrester: [C: 031] Close wikimania2014wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) (owner: 10TTO) [15:35:39] Why did the performance of the WMF Tor relay worsen so much since May? https://atlas.torproject.org/#details/DB19E709C9EDB903F75F2E6CA95C84D637B62A02 [15:36:24] Nemo_bis: who knows. [15:36:48] we upgraded it last week to the latest and greatest version, we'll see if it gets any better [15:37:20] I'm planning to go multi-instance once 0.2.7 becomes stable, that may help double our throughput in total [15:37:22] Ah, is that the reason this doppelganger briefly appeared https://atlas.torproject.org/#details/DB093A557E8BDCD8B0283EA5D86C3B4DD5385E47 [15:37:40] hmm [15:37:43] that's weird [15:38:03] dunno, mutante did the original upgrade, maybe something went bad there [15:39:20] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1773847 (10ssastry) They are not puppetized. It would require nodejs, and some upstart config... [15:45:52] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: puppet fail [15:55:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Add tunable number of keepalive retries to IdleConnectionMonitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/249983 (owner: 10Giuseppe Lavagetto) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151102T1600). [16:00:05] matt_flaschen Krenair Glaisher: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:27] <-- available [16:00:30] hey [16:01:00] matt_flaschen, around? [16:01:32] Krenair: Are you doing SWAT? [16:01:43] looks like it [16:02:14] ok, heads up that updateCollation needs to be run for a few wikis [16:02:25] yes, I read the commit message :) [16:02:40] :-) [16:02:46] will you be able to check that it's working correctly? [16:05:13] I could check whether the categories sorting changes :) [16:06:04] (yay, collations!) [16:08:10] (03CR) 10Alex Monk: [C: 032] Set $wgCategoryCollation to uca-sr at srprojects (-wikipedia, wiktionary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250401 (https://phabricator.wikimedia.org/T115806) (owner: 10Glaisher) [16:08:32] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to uca-sr at srprojects (-wikipedia, wiktionary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250401 (https://phabricator.wikimedia.org/T115806) (owner: 10Glaisher) [16:08:54] (03PS1) 10Paladox: Re add some regex.global. code back for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/250444 [16:09:07] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/250401/ (duration: 00m 18s) [16:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:13] Glaisher, ^ [16:09:35] ok, script needs to be run now. which wiki will you be doing first? [16:10:06] done srwikibooks [16:10:24] (03PS2) 10Paladox: Re add some regex.global. code back for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/250444 [16:10:40] doing srwikinews [16:11:22] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:37] Krenair, here now, sorry. [16:13:11] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [16:13:34] Krenair: if it's not that slow (matmarex just said on the task that it will be safe), can we get it done for wiktionary as well? [16:13:46] (03PS1) 10Paladox: Re enable bzip2 for gitblit downloads [puppet] - 10https://gerrit.wikimedia.org/r/250447 [16:13:52] srwikiquote+srwikisource+rswikimedia done in the mean time, Glaisher [16:14:03] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:05] srwikinews has 500k rows to fix, so. . . [16:14:06] ah, nice [16:14:17] uh.. [16:14:17] (03PS2) 10Paladox: Re enable bzip2 for gitblit downloads [puppet] - 10https://gerrit.wikimedia.org/r/250447 [16:14:38] I must've misread the count at srwikinews [16:16:24] or they just add a ton of categories to pages [16:18:52] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:47] (03PS3) 10Rush: Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) [16:19:51] Glaisher, anyway, everything else looking okay? [16:20:35] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [16:20:47] Krenair: yep, nothing broken so far afaict [16:21:03] but I didn't notice any change in any of the categories I was checking [16:21:30] all of them were small categories and probably there was nothing to change in them :) [16:21:53] done [16:22:32] matt_flaschen, hey [16:22:41] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:43] cool [16:23:07] and doesn't look like anything's broken on any of the wikis [16:23:41] Krenair: I'll get the other wikis done in another window then [16:23:55] Thank you [16:24:14] Krenair, hey. [16:24:27] (03PS2) 10Alex Monk: Remove Flow cache version override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249935 (https://phabricator.wikimedia.org/T117138) (owner: 10Mattflaschen) [16:24:39] (03CR) 10Alex Monk: [C: 032] Remove Flow cache version override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249935 (https://phabricator.wikimedia.org/T117138) (owner: 10Mattflaschen) [16:25:11] (03Merged) 10jenkins-bot: Remove Flow cache version override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249935 (https://phabricator.wikimedia.org/T117138) (owner: 10Mattflaschen) [16:25:21] (03PS4) 10Rush: Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) [16:25:25] (03PS5) 10Faidon Liambotis: Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) (owner: 10Rush) [16:25:59] matt_flaschen, so if I understand correctly, the config change is a no-op, but the two files in the Flow extension change need to be synchronised as closely together as possible? [16:26:19] Krenair, yes. [16:26:44] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/249935/ (duration: 00m 18s) [16:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:49] matt_flaschen, config ^ [16:27:48] Krenair, did a very quick smoke test. No apparent issues. [16:28:21] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [16:28:35] matt_flaschen, done, please check [16:28:36] !log krenair@tin Synchronized php-1.27.0-wmf.4/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/249940/1 (duration: 00m 20s) [16:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:42] !log Deleted logstash-2015.10.03 index to free disk space on logstash1004; see T113571 [16:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:15] Krenair, thanks. Looks good. [16:32:25] ok [16:34:27] (03PS3) 10Alex Monk: beta: Use protocol-relative link to connect to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249794 [16:34:33] (03CR) 10Alex Monk: [C: 032] beta: Use protocol-relative link to connect to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249794 (owner: 10Alex Monk) [16:34:40] (03PS2) 10Paladox: Re enable git.enableGitServlet [puppet] - 10https://gerrit.wikimedia.org/r/250450 [16:35:07] (03Merged) 10jenkins-bot: beta: Use protocol-relative link to connect to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249794 (owner: 10Alex Monk) [16:35:18] jynus: thanks for noting the disk space issue on the logstash hosts. I've band-aided the problem to get the cluster back to green. I have an idea for what to do next to give us some more head room on these hosts. [16:35:32] PROBLEM - dhclient process on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:55] Krenair: is it too late for me to stick a beta-only change in SWAT? (deploying UrlShortener) [16:36:01] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/249794/ - should be a noop for prod (duration: 00m 17s) [16:36:02] legoktm, nope [16:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:32] * legoktm rebases https://gerrit.wikimedia.org/r/#/c/248602/ [16:36:50] greg-g, hey, is that okay? ^ [16:37:21] RECOVERY - dhclient process on analytics1032 is OK: PROCS OK: 0 processes with command name dhclient [16:38:02] (03PS4) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) [16:38:17] oh, do we need greg's approval? [16:38:47] legoktm, yes, because this is a new extension and you want it in swat [16:38:55] that's against the swat guidelines [16:39:31] right right. [16:41:31] ok, no rush then [16:41:48] looking at the checklist - I don't think product/design review are relevant to beta [16:41:58] branching isn't [16:42:33] mediawiki.org docs I can live without for this [16:43:46] (03PS2) 10Alex Monk: Close wikimania2014wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) (owner: 10TTO) [16:44:07] (03CR) 10Alex Monk: [C: 032] Close wikimania2014wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) (owner: 10TTO) [16:44:41] (03Merged) 10jenkins-bot: Close wikimania2014wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250416 (https://phabricator.wikimedia.org/T105675) (owner: 10TTO) [16:45:54] !log krenair@tin Synchronized dblists/closed.dblist: https://gerrit.wikimedia.org/r/#/c/250416/ - close wikimania2014.wikimedia.org (duration: 00m 18s) [16:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:41] !log krenair@tin Synchronized dblists/visualeditor-default.dblist: https://gerrit.wikimedia.org/r/#/c/250416/ (duration: 00m 17s) [16:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:18] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/250416/ (duration: 00m 17s) [16:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:53] (03CR) 10Alex Monk: "I don't think this would actually perform the renames requested?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [16:56:39] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Apertium: Add missing apertium-br-fr"" [puppet] - 10https://gerrit.wikimedia.org/r/250452 [16:56:58] (03PS2) 10Alexandros Kosiaris: Revert "Revert "Apertium: Add missing apertium-br-fr"" [puppet] - 10https://gerrit.wikimedia.org/r/250452 [16:57:00] (03CR) 10Luke081515: "Sorry, the tab there was not my intention." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [16:59:08] (03PS6) 10Alex Monk: Enable four new namespaces at thwikitionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247088 (https://phabricator.wikimedia.org/T114458) (owner: 10Luke081515) [16:59:15] (03CR) 10Alex Monk: [C: 032] Enable four new namespaces at thwikitionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247088 (https://phabricator.wikimedia.org/T114458) (owner: 10Luke081515) [16:59:39] (03Merged) 10jenkins-bot: Enable four new namespaces at thwikitionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247088 (https://phabricator.wikimedia.org/T114458) (owner: 10Luke081515) [17:00:04] matt_flaschen: Dear anthropoid, the time has come. Please deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151102T1700). [17:00:19] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/247088/ (duration: 00m 17s) [17:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:55] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1774236 (10Ottomata) [17:03:21] (03PS1) 10Paladox: Show more then 5 commits [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) [17:05:43] (03PS2) 10Ottomata: Enable the limn-ee-data report runner [puppet] - 10https://gerrit.wikimedia.org/r/249812 (owner: 10Milimetric) [17:05:58] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Revert "Apertium: Add missing apertium-br-fr"" [puppet] - 10https://gerrit.wikimedia.org/r/250452 (owner: 10Alexandros Kosiaris) [17:06:28] (03PS2) 10Alex Monk: Add reupload-shared right to autoconfirmed users (ruwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249762 (https://phabricator.wikimedia.org/T116575) (owner: 10Luke081515) [17:06:34] (03CR) 10Alex Monk: [C: 032] Add reupload-shared right to autoconfirmed users (ruwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249762 (https://phabricator.wikimedia.org/T116575) (owner: 10Luke081515) [17:07:03] (03Merged) 10jenkins-bot: Add reupload-shared right to autoconfirmed users (ruwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249762 (https://phabricator.wikimedia.org/T116575) (owner: 10Luke081515) [17:07:46] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/249762/ (duration: 00m 19s) [17:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:47] (03PS2) 10Paladox: Show more then 5 commits [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) [17:08:49] (03PS3) 10Ottomata: Enable the limn-ee-data report runner [puppet] - 10https://gerrit.wikimedia.org/r/249812 (owner: 10Milimetric) [17:08:55] (03CR) 10Ottomata: [C: 032 V: 032] Enable the limn-ee-data report runner [puppet] - 10https://gerrit.wikimedia.org/r/249812 (owner: 10Milimetric) [17:09:05] (03PS3) 10Paladox: Show more then 5 commits per repo page [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) [17:09:32] !log Decreased replica count to 1 for logstash-2015.10.04 thru logstash-2015.10.12 to free cluster disk space; see T117438 [17:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:30] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1774292 (10MoritzMuehlenhoff) Summarising some bits already mentioned at the offsite along with further tests done and a plan how to move forward (see below). **Objective** The important security property to ga... [17:15:51] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:54] etherpad.wikimedia.org died a few minutes ago if one can kick it on etherpad1001 .. [17:16:08] (03PS1) 10Luke081515: Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) [17:17:07] hashar: I think Alex is already on it [17:17:33] thx :à [17:17:41] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.005 second response time [17:18:48] (03PS2) 10Luke081515: Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) [17:23:35] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1774347 (10Luke081515) 3NEW [17:24:05] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1774356 (10mmodell) a:3mmodell [17:29:12] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1774385 (10Krenair) IIRC, labswiki jobs are supposed to be running locally on silver only... [17:29:18] (03CR) 10EBernhardson: "will be deploying this in the SF afternoon SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [17:30:52] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:32] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:36:22] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.357 second response time [17:39:11] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [17:39:11] (03PS7) 10Rush: Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) [17:39:13] (03PS3) 10Rush: Allocate reserved labs-hosts1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/249914 [17:39:15] (03PS1) 10Rush: Fix codfw row a labs-hosts1 and labs-support1 IP overlap [dns] - 10https://gerrit.wikimedia.org/r/250458 [17:39:37] 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1774429 (10Cmjohnson) This has been pushed to ops meeting on 11/9 [17:39:51] !log restbase cassandra: switched local_group_wikisource_T_parsoid_html and local_group_wikibooks_T_parsoid_html to DTCS [17:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:11] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:11] (03PS1) 10Mattflaschen: Add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 [17:45:18] (03PS2) 10Cmjohnson: admin: add jzerebecki to deployers [puppet] - 10https://gerrit.wikimedia.org/r/249818 (https://phabricator.wikimedia.org/T116487) (owner: 10Dzahn) [17:45:50] (03CR) 10Mattflaschen: [C: 04-2] "Needs Collaboration team review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [17:46:37] (03CR) 10Cmjohnson: [C: 032] admin: add jzerebecki to deployers [puppet] - 10https://gerrit.wikimedia.org/r/249818 (https://phabricator.wikimedia.org/T116487) (owner: 10Dzahn) [17:48:41] is there work going on in ulsfo this am? [17:49:01] seen a few windows of packet drops.. robh: maybe you know? [17:49:10] and it /looks/ like the ulsfo pop is offline? [17:49:16] cajoel: Oh yes, we completely depooled it! [17:49:27] and it seems we fucked up on notification, sorry =[ [17:49:31] how about general network changes going on? [17:49:39] we're replacing the patch panels there [17:49:39] we've had a few connectivity drops.. [17:49:43] so eveyrthing would have flipped [17:49:46] at least once [17:49:52] makes sense [17:49:55] and again, we should have let you know, really sorry =[ [17:50:09] is it stable now, or should I consider moving to Monkeybrains for a bit? [17:50:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1774488 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson The mandatory waiting period was over and Greg approved. Merged the changed https://gerrit.wikimed... [17:50:26] Supposedly they are done with the migration and are now only labeling the panel [17:50:29] so it should be stable [17:50:31] cool [17:50:37] but we havent repooled service to it yet until they finish the labeling [17:50:42] in case he trips or sometihng [17:50:45] but yea, you should be ok. [17:50:54] do you guys have a phab ticket type for hands on work or traffic impacting work? [17:51:01] maybe something I could watch for? [17:51:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1774496 (10Cmjohnson) Pushing to next ops meeting 11/9 [17:51:15] uh, lemme see what i tagged on this one [17:51:20] (what I really should be doing is attending ops meetings, but at the moment they don't line up with a good time slot for me) [17:51:23] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.004 second response time [17:51:40] well, i should have realized this would affect you guys and sent an email out specifically [17:51:41] oh I responded on the other channel [17:51:43] and it just slipped my brain [17:52:00] legoktm: are you responsible for https://en.wikipedia.org/wiki/Special:GadgetUsage ? good work if so. looks useful. [17:52:10] so we never completely lost connectivity to ulsfo, it was just a series of ups/downs and BGP convergence issues [17:52:24] ori: I am not! Niharika is :) [17:52:34] Niharika: kudos! [17:53:35] paravoid: the benefits of one tech doing one rack and then the other, woo redundancy \o/ [17:58:15] Just FYI, the scheduled FlowFixLinks maintenance script for this window is still running, and will probably go a little outside the window. I will log when it's complete. [17:59:33] !log restarting elastic on nobelium [17:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:04] ebernhardson: how's the nobelium copy going? [18:01:37] !log performing some fixes on labsdb1004 database. Slightly higher load on labsdb1005. [18:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:48] (03CR) 10Chad: [C: 031] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:08:37] YuviPanda: crashed over the weekend [18:08:51] YuviPanda: (elasticsearch crashed out, but not completely enough to get restarted) [18:08:59] basically the jvm heap filled up ... not sure best solution yet [18:12:29] ah [18:13:08] 6operations: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1774610 (10ori) 3NEW [18:13:35] 6operations, 5Continuous-Integration-Scaling, 3labs-sprint-119: Allow network flow between labs instance and scandium - https://phabricator.wikimedia.org/T116975#1774620 (10chasemp) [18:13:45] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 3 others: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1774624 (10chasemp) [18:13:50] YuviPanda: the short of it seems to be one machine with a single 32gb jvm is just not going to cut it. jvm is fun! [18:13:56] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 2 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1774634 (10chasemp) [18:15:32] ebernhardson: augh :( [18:15:39] ebernhardson: don't you limit heap to 30G? [18:15:51] YuviPanda: we do, but the old generation took up 28.6G [18:15:56] YuviPanda: and running a GC didn't take anything out of it :( [18:16:00] ah damn [18:20:45] ebernhardson: so what should we do now? [18:20:55] ebernhardson: drop a bunch of big wikis? [18:23:08] YuviPanda: i dunno, i'm hoping dcausse has some ideas :) [18:23:49] YuviPanda: he was thinking it may be fielddata which is unbounded in ES, but can't check because its not being collected in graphite and the jvm is frozen. basically have to update metrics collection and let it crash again [18:24:14] i should stop calling it crashed, what its really doing is permenantly in a GC cycle trying to find memory [18:24:32] ebernhardson: we should wait a bit more and check fieldcache usage [18:24:54] ok [18:25:04] someone restarted nobelium? [18:25:10] sorry i just did that :( [18:25:13] ah ok :) [18:28:30] 6operations, 6Labs, 10Labs-Infrastructure: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#1774715 (10yuvipanda) [18:32:31] ebernhardson: heh, that kinda sucks :( [18:32:33] but oh well! [18:36:00] greg-g: Mind if I create deployment window for two one-line fixes at 1900 UTC? [18:36:15] On in Wikidata, one in MobileFrontend, both mobile related [18:37:56] (03PS3) 10Dzahn: snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 [18:39:01] (03PS4) 10Dzahn: snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 [18:39:15] mutante: get in line... [18:39:28] (03PS5) 10Dzahn: snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 [18:39:29] I have a bunch of snapshot patches up for review since PR... [18:40:13] paravoid: heh, itsite.pp t [18:40:20] arg, let me try that again [18:40:22] hoo: doit [18:40:26] Thanks :) [18:41:10] paravoid: it's more about site.pp and the role keyword and these "if $::fqdn" within a node section that make things messy [18:41:19] I don't disagree [18:41:31] !log manually starting service replicate-tools on labstore1001 [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:39] just saying, don't expect a review very soon :P [18:41:51] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code [18:41:53] oh, i was about to merge, got 2 x +1 :) [18:42:00] oh heh [18:42:01] lucky you [18:42:38] (03CR) 10Dzahn: [C: 032] snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 (owner: 10Dzahn) [18:43:43] RECOVERY - salt-minion processes on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:45:42] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [18:46:33] RECOVERY - salt-minion processes on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:48:14] !log dist-upgrade and rebooting labvirt1005 [18:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:58] (03PS1) 10Jforrester: Enable VisualEditor for 10% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250470 (https://phabricator.wikimedia.org/T117410) [18:52:00] (03PS1) 10Jforrester: Enable VisualEditor for 25% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) [18:52:02] (03PS1) 10Jforrester: Enable VisualEditor for 50% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250472 (https://phabricator.wikimedia.org/T117410) [18:52:04] (03PS1) 10Jforrester: Enable VisualEditor for all new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250473 (https://phabricator.wikimedia.org/T117410) [18:52:06] (03PS1) 10Jforrester: Enable VisualEditor for all accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250474 (https://phabricator.wikimedia.org/T117410) [18:53:54] !log deployed patches for T97897 & T115522 [18:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:55] greg-g: Heads-up about https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=198879&oldid=198876 BTW. [18:56:27] James_F: weee [18:56:52] greg-g: Inorite? Eswiki, so checking with FR but 'should' be OK. [18:59:07] (03PS1) 10Andrew Bogott: Add labvirt1005 to the nova scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/250475 [19:00:13] !log Completed FlowFixLinks run on all Flow wikis [19:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:02] (03CR) 10Jforrester: [C: 04-2] "Not just yet. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250474 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [19:02:11] (03CR) 10Jforrester: [C: 04-2] "Not just yet. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250473 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [19:02:29] (03CR) 10Jforrester: [C: 04-1] "Roughly scheduled for 17 November." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250472 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [19:02:40] (03CR) 10Jforrester: [C: 04-1] "Roughly scheduled for 10 November." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250471 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [19:02:59] (03CR) 10Jforrester: "Provisionally scheduled for tomorrow, 3 November." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250470 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [19:05:26] (03Abandoned) 10Dzahn: nobelium: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250076 (owner: 10Dzahn) [19:09:34] (03CR) 10Yuvipanda: "I think if we explicitly set provider => fastapt in exec_environ / dev_environ only it'll be ok." [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen) [19:09:55] valhallasw`cloud: ^ I'm up for merging :D [19:11:25] YuviPanda: why only exec_environ and dev_environ? [19:11:46] valhallasw`cloud: at least to begin with, I guess? those are the biggest package installers inside tools [19:11:54] ah, right [19:14:20] !log restbase cassandra: switching local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ to DTCS [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:18] (03PS3) 10Dzahn: Add DNS entries for the CODFW labtest servers [dns] - 10https://gerrit.wikimedia.org/r/250164 (https://phabricator.wikimedia.org/T117107) (owner: 10Papaul) [19:17:19] (03CR) 10Dzahn: [C: 032] "yep, WMF inventory numbers match hostnames, consistent with ticket" [dns] - 10https://gerrit.wikimedia.org/r/250164 (https://phabricator.wikimedia.org/T117107) (owner: 10Papaul) [19:23:15] 6operations: diamond doesn't gracefully handled elasticsearch failure - https://phabricator.wikimedia.org/T117461#1774942 (10EBernhardson) 3NEW [19:23:55] (03PS1) 10Filippo Giunchedi: install_server: partman config for ms-be HP machines in codfw [puppet] - 10https://gerrit.wikimedia.org/r/250479 [19:26:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: partman config for ms-be HP machines in codfw [puppet] - 10https://gerrit.wikimedia.org/r/250479 (owner: 10Filippo Giunchedi) [19:27:15] godog: still not EFI I guess? [19:27:26] paravoid: no :( [19:27:43] that's disappointing :(o [19:29:48] indeed, OTOH since it involves changing partitions for example we might not want to start with swift or anything stateful really [19:30:05] hoo: Respected human, time to deploy Wikidata related backports (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151102T1930). Please do the needful. [19:30:12] related is https://phabricator.wikimedia.org/T93208 [19:31:31] 6operations: diamond doesn't gracefully handled elasticsearch failure - https://phabricator.wikimedia.org/T117461#1774972 (10EBernhardson) This is visible in metrics like `servers.nobelium.diskspace._var_lib_elasticsearch.byte_free` which has a gap from 10/31 through 11/2 [19:36:52] 6operations: diamond doesn't gracefully handled elasticsearch failure - https://phabricator.wikimedia.org/T117461#1775026 (10chasemp) Huh -- weird I will try to take a look. Can I stop ES there to test? [19:39:46] 6operations: diamond doesn't gracefully handled elasticsearch failure - https://phabricator.wikimedia.org/T117461#1775050 (10EBernhardson) yea, but i'm not sure if just stopping will be enough. Over the weekend elasticsearch was still running and i think the ports were open, but the jvm wasn't able to respond to... [19:41:03] slow jenkins is slow [19:41:54] adds 10x "recheck" and rebase to wake it up [19:42:25] awight ^ [19:43:43] Krenair: that would be me, sorry. [19:44:16] I thought we'd fixed the integration-config issue with https://phabricator.wikimedia.org/T117062 [19:45:01] The wikimedia-fundraising-crm job shouldn't run on *deployment* branches, we only want basic lint... and there's too much pressure to fix other live stuff to revisit right now :( [19:45:51] on second thought... lemme take the time to fix this now. But I'll need help. [19:46:30] This is what I tried, and it didn't work: https://gerrit.wikimedia.org/r/#/c/249968/ [19:46:43] legoktm, ^ [19:46:52] !log hoo@tin Synchronized php-1.27.0-wmf.4/extensions/MobileFrontend/includes/MobileFrontend.skin.hooks.php: Fix changing the license message key via the "MobileLicenseLink" hook (duration: 00m 17s) [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:01] "Run the job on any branch which does not contain the string "deployment" anywhere, and is not exactly "contrib" [19:47:09] is what I want [19:47:11] Licenses on mobile Wikidata are bock :) [19:47:16] awight: it's not working? [19:47:18] hm [19:47:43] * legoktm reads zuul docs [19:47:56] sorry ;) [19:48:10] 6operations, 5Patch-For-Review: migrate pollux/plutonium into VMs - https://phabricator.wikimedia.org/T117182#1775083 (10akosiaris) pollux was migrated into a VM today. plutonium is going to be more difficult. It will need a IP change as it is in a different row than the ganeti cluster [19:50:42] (03CR) 10Dzahn: [C: 031] "yes, matches the existing eqiad setup" [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) (owner: 10Rush) [19:55:24] (03PS1) 10Faidon Liambotis: Revert "Drain ulsfo, planned on-site maintainance" [dns] - 10https://gerrit.wikimedia.org/r/250489 [19:55:29] (03PS2) 10Faidon Liambotis: Revert "Drain ulsfo, planned on-site maintainance" [dns] - 10https://gerrit.wikimedia.org/r/250489 [19:56:43] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain ulsfo, planned on-site maintainance" [dns] - 10https://gerrit.wikimedia.org/r/250489 (owner: 10Faidon Liambotis) [19:57:04] !log repooling ulsfo [19:57:39] 6operations, 10Traffic, 10netops, 5Patch-For-Review: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1775110 (10faidon) 5Open>3Resolved Drained and subsequently repooled. [20:00:43] !log ms-be2016 - signing puppet certs, salt-key, initial run [20:01:29] logmsgbot: hey [20:01:47] !log hoo@tin Synchronized php-1.27.0-wmf.4/extensions/Wikidata: Update Wikibase: Use a local URL as action in forms created by SpecialModifyEntity (duration: 00m 26s) [20:01:48] !log are we still logging? [20:02:26] morebots died [20:02:42] ok, so i did that restart once not long ago.. [20:02:43] checking [20:04:11] Verified, ok [20:04:38] on tool labs [20:04:46] last time it was running but needed to be killed [20:04:50] this time it's just not running [20:05:31] Your job 1153912 ("production-logbot") has been submitted [20:05:55] !log started morebots (production-logbot job was gone) [20:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:19] Mh... why is there no cron that kicks it every now and then? [20:06:26] SGE should prevent it from running twice [20:06:26] !log ms-be2016 - signing puppet certs, salt-key, initial run [20:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:39] papaul: ^ see how it works now? morebots replied to me [20:08:05] mutante: ok [20:08:38] mutante: are you going to do that for all the servers ? [20:10:11] papaul: i can if it's helpful for coordination to work on it with multiple people, yea [20:10:51] or i can summarize it and do "ms-be200[0-9]" or something [20:11:04] ok [20:13:15] technically we say "admin log whenever you enable or disable puppet on a host", and this is the first time it's enabled. and since it adds it to icinga you might see icinga-wm talking about it soon after [20:13:25] which makes people wonder.. unless they just saw somebody !log [20:20:36] (03PS16) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [20:26:03] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1775213 (10Nuria) 3NEW [20:32:00] 6operations: reclaim lawrencium to spares - https://phabricator.wikimedia.org/T117477#1775264 (10RobH) 3NEW a:3RobH [20:34:15] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1775276 (10RobH) 5Open>3declined [20:34:16] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1775277 (10RobH) [20:35:26] (03PS1) 10RobH: reclaim lawrencium to spares [puppet] - 10https://gerrit.wikimedia.org/r/250498 [20:35:52] PROBLEM - swift-object-replicator on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:35:53] PROBLEM - swift-object-server on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:36:31] papaul: ^ that's what i meant, so we added it to puppet and that adds it to icinga but the service isn't there yet. so i will add a downtime [20:36:42] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet has 11 failures [20:36:43] (03CR) 10RobH: [C: 032] reclaim lawrencium to spares [puppet] - 10https://gerrit.wikimedia.org/r/250498 (owner: 10RobH) [20:37:13] mutante: got it [20:37:22] PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:37:41] PROBLEM - swift-account-reaper on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:37:52] PROBLEM - swift-account-replicator on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:37:54] that's because they already match a node regex in site.pp [20:38:21] !log lawrencium having puppet stopped and reclaiming into spares per T117477 disregard any alerts [20:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:33] so it happens all by itself just when they get added to puppet [20:38:33] !log csteipp@tin Synchronized php-1.27.0-wmf.4/includes/: (no message) (duration: 00m 22s) [20:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:15] papaul: it' [20:39:29] mutante: ok [20:39:44] papaul: we can't disable the notifications though before it has been added, so it's kind of a race [20:40:13] mutante: ok [20:41:16] ACKNOWLEDGEMENT - swift-account-auditor on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor daniel_zahn new install [20:41:16] ACKNOWLEDGEMENT - swift-account-reaper on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper daniel_zahn new install [20:41:16] ACKNOWLEDGEMENT - swift-account-replicator on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator daniel_zahn new install [20:41:16] ACKNOWLEDGEMENT - swift-object-replicator on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator daniel_zahn new install [20:41:16] ACKNOWLEDGEMENT - swift-object-server on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server daniel_zahn new install [20:44:52] ACKNOWLEDGEMENT - tileratorui on maps-test2003 is CRITICAL: Connection refused daniel_zahn scheduled downtime [20:44:52] ACKNOWLEDGEMENT - tileratorui on maps-test2004 is CRITICAL: Connection refused daniel_zahn scheduled downtime [20:47:03] godog: with new codfw swift servers being added.. is it not true anymore that they are all "swift::storage"? none of them swift::proxy? [20:47:41] since: node /^ms-be20[0-9][0-9]\.codfw\.wmnet$ has the storage role and by just singing the puppet cert and nothing else they already match that [20:49:17] !log ms-be2017 - initial puppet run, fresh install [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:41] 6operations: reclaim lawrencium to spares - https://phabricator.wikimedia.org/T117477#1775329 (10RobH) [20:50:07] 6operations: reclaim lawrencium to spares - https://phabricator.wikimedia.org/T117477#1775333 (10RobH) a:5RobH>3Cmjohnson Assigning to @cmjohnson for disk wipe and adding in #ops-eqiad. [20:50:22] 6operations, 10ops-eqiad: reclaim lawrencium to spares - https://phabricator.wikimedia.org/T117477#1775336 (10RobH) [20:52:31] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:52:42] RECOVERY - swift-account-reaper on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:52:52] RECOVERY - swift-object-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:52:52] RECOVERY - swift-object-server on ms-be2016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:53:01] RECOVERY - swift-account-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:53:26] (03PS1) 10Papaul: Update MAC address for ms-be2020 Bug:T114712 [puppet] - 10https://gerrit.wikimedia.org/r/250500 (https://phabricator.wikimedia.org/T114712) [20:58:30] (03PS1) 10BryanDavis: logstash: Drop replica count to 1 after 21 days [puppet] - 10https://gerrit.wikimedia.org/r/250501 (https://phabricator.wikimedia.org/T117438) [21:00:05] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151102T2100). [21:00:53] (03PS1) 10Rush: phab: allow cc for direct task targeting [puppet] - 10https://gerrit.wikimedia.org/r/250502 [21:02:44] (03PS2) 10Rush: phab: allow cc for direct task targeting [puppet] - 10https://gerrit.wikimedia.org/r/250502 [21:03:09] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1775382 (10aaron) Ideal would be something like: Core: Very fast dual core (HT is not useful here either) RAM: 96GB RAM (more if cheap). The DIMMs don't have to be super-fast in Mhz... [21:04:16] (03CR) 10Rush: [C: 032 V: 032] phab: allow cc for direct task targeting [puppet] - 10https://gerrit.wikimedia.org/r/250502 (owner: 10Rush) [21:04:36] 6operations, 7Documentation: Incident response protocol needs a refresh - https://phabricator.wikimedia.org/T89800#1775387 (10faidon) a:3mark @mark was working on it and there is already a new revision out. I'm not sure if this task should be considered done or kept open until it's iterated further. [21:05:32] (03PS2) 10Andrew Bogott: Add labvirt1005 to the nova scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/250475 [21:06:58] (03CR) 10Andrew Bogott: [C: 032] Add labvirt1005 to the nova scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/250475 (owner: 10Andrew Bogott) [21:07:32] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1775415 (10chasemp) [21:07:34] 6operations, 5Continuous-Integration-Scaling, 3labs-sprint-119: Allow network flow between labs instance and scandium - https://phabricator.wikimedia.org/T116975#1775413 (10chasemp) 5Open>3Resolved this should all work now [21:07:42] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Puppet has 2 failures [21:11:42] PROBLEM - salt-minion processes on db2042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:11:52] PROBLEM - salt-minion processes on mw1172 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:11:53] PROBLEM - salt-minion processes on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:12:03] PROBLEM - salt-minion processes on mw2095 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:12:12] PROBLEM - salt-minion processes on db2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:12:42] PROBLEM - salt-minion processes on db1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:12:43] PROBLEM - salt-minion processes on ms-be2016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:12:50] anyone doing salt related things? [21:12:51] PROBLEM - salt-minion processes on db2047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:02] PROBLEM - salt-minion processes on mw1236 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:02] PROBLEM - salt-minion processes on mw2141 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:21] PROBLEM - salt-minion processes on mw2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:22] PROBLEM - salt-minion processes on restbase2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:32] PROBLEM - salt-minion processes on mw1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:42] PROBLEM - salt-minion processes on es1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:43] PROBLEM - salt-minion processes on db2053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:13:43] PROBLEM - salt-minion processes on mw1208 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:01] PROBLEM - salt-minion processes on mw2087 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:22] PROBLEM - salt-minion processes on cp2014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:31] PROBLEM - Parsoid on wtp1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.006 second response time [21:14:31] PROBLEM - salt-minion processes on dataset1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:31] PROBLEM - Parsoid on wtp2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.079 second response time [21:14:31] PROBLEM - Parsoid on wtp2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.072 second response time [21:14:32] PROBLEM - salt-minion processes on oxygen is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:32] PROBLEM - Parsoid on wtp1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.007 second response time [21:14:32] PROBLEM - Parsoid on wtp1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.020 second response time [21:14:33] PROBLEM - Parsoid on wtp1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.013 second response time [21:14:42] PROBLEM - Parsoid on wtp2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.077 second response time [21:14:42] PROBLEM - Parsoid on wtp1008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.007 second response time [21:14:42] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.014 second response time [21:14:50] PROBLEM - Parsoid on wtp1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.006 second response time [21:14:51] PROBLEM - Parsoid on wtp2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.075 second response time [21:15:02] PROBLEM - Parsoid on wtp1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.013 second response time [21:15:02] PROBLEM - Parsoid on wtp1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.006 second response time [21:15:03] PROBLEM - salt-minion processes on cp4019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:15:11] PROBLEM - Parsoid on wtp2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.084 second response time [21:15:11] PROBLEM - Parsoid on wtp2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.103 second response time [21:15:11] PROBLEM - Parsoid on wtp2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.078 second response time [21:15:12] PROBLEM - Parsoid on wtp1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.005 second response time [21:15:15] (03PS3) 10EBernhardson: Enable A/B test for combined language search. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) (owner: 10Smalyshev) [21:15:21] PROBLEM - Parsoid on wtp2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.103 second response time [21:15:23] PROBLEM - Parsoid on wtp1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.012 second response time [21:15:23] PROBLEM - Parsoid on wtp2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.097 second response time [21:15:28] gwicke: is this you?^ [21:15:31] PROBLEM - Parsoid on wtp2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.097 second response time [21:15:31] PROBLEM - Parsoid on wtp2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.097 second response time [21:15:31] PROBLEM - Parsoid on wtp2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.079 second response time [21:15:32] PROBLEM - Parsoid on wtp2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.097 second response time [21:15:33] PROBLEM - Parsoid on wtp1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.011 second response time [21:15:36] what's going on here? [21:15:42] PROBLEM - Parsoid on wtp2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.075 second response time [21:15:43] PROBLEM - Parsoid on wtp1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.003 second response time [21:15:43] PROBLEM - Parsoid on wtp1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.006 second response time [21:15:44] indeed, parsoid just paged and no SAL entreis [21:15:51] PROBLEM - Parsoid on wtp1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.003 second response time [21:15:52] PROBLEM - Parsoid on wtp2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.080 second response time [21:15:53] PROBLEM - Parsoid on wtp1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.010 second response time [21:15:53] PROBLEM - Parsoid on wtp1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.014 second response time [21:15:53] PROBLEM - Parsoid on wtp1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.030 second response time [21:16:01] RECOVERY - salt-minion processes on mw2087 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:16:01] PROBLEM - Parsoid on wtp2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.099 second response time [21:16:08] it seems we have hit a parsoid bug there [21:16:11] <_joe_> subbu: you here? [21:16:11] PROBLEM - Parsoid on wtp1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.004 second response time [21:16:12] PROBLEM - Parsoid on wtp1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.003 second response time [21:16:12] PROBLEM - Parsoid on wtp2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.108 second response time [21:16:12] PROBLEM - Parsoid on wtp1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.008 second response time [21:16:12] PROBLEM - Parsoid on wtp1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.014 second response time [21:16:12] PROBLEM - Parsoid on wtp2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.120 second response time [21:16:12] PROBLEM - Parsoid on wtp2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.093 second response time [21:16:13] PROBLEM - Parsoid on wtp2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.077 second response time [21:16:18] subbu: ? [21:16:21] <_joe_> mobrovac: oh? [21:16:21] PROBLEM - Parsoid on wtp1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1499 bytes in 0.006 second response time [21:16:21] RECOVERY - salt-minion processes on cp2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:16:24] yes. restarting. [21:16:37] let me see if that fixes it. [21:16:41] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.009 second response time [21:16:42] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.025 second response time [21:16:42] <_joe_> ok, you are handling this? [21:16:50] RECOVERY - salt-minion processes on db2047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:16:51] <_joe_> if possible leave one machine non-restarted [21:16:51] yes. [21:16:59] <_joe_> so that we can inspect it [21:17:09] was this a product of intentional maint or unknown? [21:17:11] i deployed new code. but this shouldn't have happened. [21:17:15] LVS is back to ok [21:17:16] <_joe_> oh ok [21:17:22] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.021 second response time [21:17:28] was there user impact? [21:17:31] <_joe_> mobrovac: I wouldn't trust that to be fully ok [21:17:34] <_joe_> jynus: yes [21:17:35] no sal entry of new deployment either [21:17:42] subbu: please log deployments [21:17:43] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.015 second response time [21:17:43] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.026 second response time [21:17:51] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.013 second response time [21:17:57] chasemp, i do .. after deployments. [21:18:01] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.010 second response time [21:18:01] i guess i should log start/end? [21:18:05] please yes [21:18:09] ok. will do in future. [21:18:16] thx =] [21:18:18] I don't see significant parsoid errors or latency changes in the restbase metrics [21:18:22] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.043 second response time [21:18:23] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.053 second response time [21:18:24] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#1775460 (10hashar) [21:18:26] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#1775458 (10hashar) 5Open>3Resolved Seems this one has been fixed ages ago. [21:18:31] RECOVERY - salt-minion processes on dataset1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:18:32] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.024 second response time [21:18:32] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.027 second response time [21:18:32] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.021 second response time [21:18:33] _joe__everythingmight have restarted. but check wtp1024 [21:18:42] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.034 second response time [21:18:44] i tried ^Z before it could restart but not sure if i caught it. [21:18:51] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.030 second response time [21:18:51] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.021 second response time [21:18:53] RECOVERY - salt-minion processes on cp4019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:18:58] subbu: for clarification… was that flock of errors the result of restarts, or are you restarting in order to fix the errors? [21:19:02] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.021 second response time [21:19:04] <_joe_> subbu: if this was effect of a deploy, it's ok, there is not much to see I think [21:19:10] <_joe_> andrewbogott: both [21:19:10] 6operations, 7Shinken: Make the Shinken IRC alert bot use colors - https://phabricator.wikimedia.org/T113785#1775463 (10hashar) [21:19:17] andrewbogott, restart fixed it. [21:19:22] we reorged the repo. [21:19:31] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.023 second response time [21:19:33] and i had added a symlink so that old server.js would still work. [21:19:41] RECOVERY - salt-minion processes on mw1208 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:19:43] Ah, ok. Just making sure someone knows why it went off in the first place, thanks. [21:19:45] but not sure what exactly broke there until i restarted. [21:19:51] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.016 second response time [21:19:51] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.027 second response time [21:20:02] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.016 second response time [21:20:02] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.022 second response time [21:20:02] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.010 second response time [21:20:02] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.012 second response time [21:20:03] <_joe_> subbu: you did change the code without restarting and that caused this? [21:20:31] RECOVERY - salt-minion processes on db1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:20:53] _joe_, https://phabricator.wikimedia.org/T115665 is the task in question. [21:20:56] what I do not understand is salt, is it related? [21:20:58] le tme verify deploy and we can continue discussion. [21:21:19] <_joe_> jynus: has to do with deployment which goes through salt [21:21:21] that's the trebuchet restart command [21:21:27] ok, gotcha [21:21:32] RECOVERY - salt-minion processes on cp1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:21:48] from restbase's perspective not much seems to have happened [21:22:22] RECOVERY - salt-minion processes on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:22:24] <_joe_> gwicke: well parsoid was effectively down for a few minutes [21:22:36] definitely [21:23:03] perhaps some stayed up? [21:23:12] https://grafana.wikimedia.org/dashboard/db/restbase?panelId=13&fullscreen [21:23:24] yes, 1013 stayed up since i restarted it as canary. [21:23:32] not much change in latencies either: https://grafana.wikimedia.org/dashboard/db/restbase?panelId=9&fullscreen [21:24:04] although, those graphs are a bit smoothed [21:24:05] why did parsoid.svc.eqiad.wmnet page on failure from icinga? [21:24:09] do not trust much, downtime was 2 minutes at most and granularity is 1 minute [21:24:41] RECOVERY - salt-minion processes on mw2141 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:24:58] yeah, there is a short blip when I remove the movingMedian [21:25:10] !log deployed parsoid f0d77afc [21:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:25] a heads up for that kind of thing would be good gwicke [21:25:38] arlolra is separately verifying the deploy .. but, it is looking good on my end. but, let us investigate why it did go down after a new deploy. [21:25:48] chasemp: I'm not part of this deploy, just saw the errors [21:25:48] i will look at logs. [21:26:09] chasemp, gwicke is not responsible there. [21:26:58] are 5xx still high or is it just me? [21:27:02] RECOVERY - salt-minion processes on mw1172 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:27:07] _joe_, when I ^C-ed the restarts, parsoid on codfw cluster servers have been restarted. is there a simple way to do that? [21:27:10] ah, maybe I misunderstood, I thought: "why did parsoid.svc.eqiad.wmnet page on failure from icinga?" and "yeah, there is a short blip when I remove the movingMedian" were related [21:27:21] *have not* [21:27:21] RECOVERY - salt-minion processes on mw2095 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:27:28] <_joe_> subbu: to restart parsoid in codfw? [21:27:33] yes. [21:28:00] <_joe_> subbu: salt -C 'G@cluster:parsoid and G@site:codfw' or salt 'wtp2*' should do the trick [21:28:02] no, thet are down again [21:28:38] jynus, what is down again? i assume you are not talking about parsoid. i see parsoid up [21:28:52] the failures are down [21:28:53] from restbase's perspective, parsoid is okay [21:28:56] oh. ok. [21:28:56] which is good [21:29:05] subbu: we track 500's etc he is saying they are going down [21:29:06] (request failures) [21:29:13] _joe_, run that on tin? [21:29:21] RECOVERY - salt-minion processes on db2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:29:24] <_joe_> subbu: uhm not sure that would work [21:29:28] <_joe_> lemme do it for you [21:29:32] thanks. :) [21:29:33] <_joe_> subbu: just restart parsoid? [21:29:38] only on codfw [21:29:46] eqiad is done. [21:30:41] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:31:16] <_joe_> !log restarting parsoids in codfw [21:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:31:22] RECOVERY - Parsoid on wtp2017 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.100 second response time [21:31:22] RECOVERY - Parsoid on wtp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.113 second response time [21:31:22] RECOVERY - Parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.098 second response time [21:31:22] RECOVERY - Parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.108 second response time [21:31:42] RECOVERY - Parsoid on wtp2012 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.122 second response time [21:31:42] RECOVERY - Parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.110 second response time [21:31:43] <_joe_> subbu: ^^ seems to work [21:31:51] RECOVERY - Parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.109 second response time [21:31:53] RECOVERY - Parsoid on wtp2011 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.120 second response time [21:32:00] thanks. [21:32:11] RECOVERY - Parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.122 second response time [21:32:11] RECOVERY - Parsoid on wtp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.118 second response time [21:32:12] RECOVERY - Parsoid on wtp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.109 second response time [21:32:22] RECOVERY - Parsoid on wtp2016 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.119 second response time [21:32:32] RECOVERY - Parsoid on wtp2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.120 second response time [21:32:32] RECOVERY - salt-minion processes on restbase2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:32:32] RECOVERY - Parsoid on wtp2019 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.121 second response time [21:32:33] RECOVERY - Parsoid on wtp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.109 second response time [21:32:33] RECOVERY - Parsoid on wtp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.117 second response time [21:32:33] RECOVERY - Parsoid on wtp2020 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.130 second response time [21:32:51] RECOVERY - Parsoid on wtp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.112 second response time [21:32:52] RECOVERY - salt-minion processes on db2053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:33:02] RECOVERY - Parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.126 second response time [21:33:03] RECOVERY - Parsoid on wtp2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.108 second response time [21:33:43] * subbu still doesn't understand why on new code deploy the previously running processes crashed. [21:33:54] _joe_, gwicke so .. we reorged the repo. [21:34:15] previous server was in api/server.js .. but we moved it around to bin/server.js .. but, I added a symlink from api/server.js to bin/server.js [21:34:23] so that existing puppet code path links won't break. [21:34:49] so, restarts would have picked up the new code just fine. but not sure why the running processes crashed and required a restart. [21:35:19] (03PS1) 10Ricordisamoa: Redirect pietrodn's intersect-contribs to Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/250516 [21:35:23] RECOVERY - salt-minion processes on oxygen is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:35:26] <_joe_> subbu: maybe it did a stat() on the file to see if it was changed, and the fact it was now a symlink made it crash [21:35:51] RECOVERY - salt-minion processes on mw1236 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:35:58] <_joe_> I dunno enough of nodejs internals, if it wasn't 10:30 PM I'd look into it [21:36:06] <_joe_> I might do that tomorrow [21:36:08] <_joe_> :) [21:36:09] ok. no problem. thanks. [21:36:13] there is no stat() at runtime normally [21:36:23] RECOVERY - salt-minion processes on es1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:36:24] it's all loaded once & then kept in memory, unless [21:36:30] you clear the require cache explicitly [21:36:35] <_joe_> gwicke: this was the most obvious explanation [21:37:03] *nod* [21:37:06] <_joe_> gwicke: it's the main file, not the required ones, I wouldn't be surprised if that was treated differently somehow [21:37:16] subbu: you expected the old process to stay up, but the new to start up in parallel? [21:37:21] <_joe_> but I have to look at node's internals [21:37:24] gwicke, no. [21:37:39] i mean: till i issued an explicit restart, i expected the old processes to continue running. [21:37:43] <_joe_> gwicke: I think he expected the old process not to crash before he stopped/started it [21:37:46] right. [21:38:03] <_joe_> which is... what I'd expect as well [21:38:22] RECOVERY - salt-minion processes on db2042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:38:38] <_joe_> subbu: what I don't get is why crash... I guess we have some logs in the systemd log for parsoid [21:38:46] <_joe_> lemme check [21:38:48] ok. [21:39:08] so, we are 100% sure that only the code was updated at that point? [21:39:39] code and unrelated conf.. but in either case, conf is not reloaded for every req. [21:39:54] https://www.mediawiki.org/wiki/Parsoid/Deployments#Monday.2C_Nov_2.2C_2015_around_1:15_pm_PT:_f0d77afc_to_be_deployed [21:39:54] RECOVERY - salt-minion processes on mw2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:40:16] "Non-functional changes: Reorganization of the Parsoid code repo + code cleanup." so much for non-functional :-) [21:41:27] <_joe_> subbu: Error: Failed to lookup view "home" in views directory "/srv/deployment/parsoid/deploy/src/api/views" [21:41:51] <_joe_> so there was some code that was relying on the old layout of directories after all :) [21:41:58] oh hmm ... [21:42:44] ok, thanks. will take a look what that was .. not that it matters now. [21:43:01] _joe_, curious what url end point does icinga hit? [21:43:05] oh, maybe view caching isn't on in express :( [21:43:13] it's too bad that trebuchet just blindly keeps going [21:43:22] subbu: we probably need to set a "production" flag somewhere [21:44:27] <_joe_> subbu: I guess the root url? [21:44:31] if it hits it home .. ironically that is probably what requires the view home. [21:44:34] yes, it does. [21:45:43] fun .. yes .. our / url renders a static page that requires those views .. so, service up monitoring crashed it. [21:46:14] and arlolra yes .. view caching is probably not turned on .. that is what we learn from this as well. [21:46:26] fixing [21:46:37] also, updating code on all nodes at once is a bad idea [21:46:56] yes, another lesson we need to learn [21:47:11] looks like it ... we deploy code on all servers and then restart on 1 server to verify everything being up. [21:47:24] if not, we revert. [21:47:32] we changed that with ansible; not sure about scap3 [21:48:36] auto-abort deploy if more than x% of nodes fail the health check after restart [22:00:05] RoanKattouw: Respected human, time to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151102T2200). Please do the needful. [22:04:42] (03PS2) 10Dzahn: Update MAC address for ms-be2020 Bug:T114712 [puppet] - 10https://gerrit.wikimedia.org/r/250500 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [22:04:50] (03CR) 10Dzahn: [C: 032] Update MAC address for ms-be2020 Bug:T114712 [puppet] - 10https://gerrit.wikimedia.org/r/250500 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [22:06:13] ACKNOWLEDGEMENT - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet has 4 failures daniel_zahn T117097 [22:06:14] ACKNOWLEDGEMENT - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn T117097 [22:07:15] _joe_, if you are still around .. curious if you see references to pegTokenizer.pegjs.txt in the syslog. otherwise tomorrow is good. [22:09:14] <_joe_> subbu: nope I don't see that, but the complete trace I see is here: https://phabricator.wikimedia.org/P2268 [22:09:21] arlolra, ^ [22:10:10] home doesn't init the tokenizer [22:10:15] subbu: ^ [22:10:17] right. [22:10:38] but, some request somewhere should have hit a worker before icinga did [22:11:06] <_joe_> subbu: I was specifically grepping parsoid logs on codfw [22:11:11] ah! [22:11:21] <_joe_> so yeah in eqiad you probably would see more errors [22:11:22] codfw is not the active prod cluster. [22:11:31] <_joe_> nope, that's why i used it [22:11:39] <_joe_> less logs to go through [22:11:42] got it. :) [22:12:59] makes sense. [22:13:04] (03CR) 10Dzahn: iridium system-wide gitconfig needs http.proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [22:13:31] twentyafterfour: you probably need https_proxy, in addition to http_proxy ^ [22:20:11] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#1775741 (10yuvipanda) [22:20:13] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#1775739 (10yuvipanda) 5Resolved>3Open No I just reopened it see T86668#1704524 [22:22:02] (03CR) 10Dzahn: "20after4: check that you have both, HTTP_PROXY but also HTTPS_PROXY." [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [22:26:43] (03CR) 10Dzahn: "Looking at the patch this refers to that mentions it was a performance optimization. Knowing that gitblit crashes quite a bit, i think it " [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) (owner: 10Paladox) [22:27:14] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1775800 (10chasemp) 5Open>3Resolved [22:28:55] (03CR) 10Dzahn: "is this just a suspicion or what makes you think this is related to crashes? i know nothing about the implications of "git.enableGitServle" [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [22:30:26] (03PS1) 10Catrope: Re-enable Flow on ptwikibooks, with a hack for the namespace conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250576 (https://phabricator.wikimedia.org/T114540) [22:30:33] (03CR) 10Dzahn: "sorry, i don't think i'm qualified to review this. i simply don't know" [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [22:33:12] (03CR) 10Mattflaschen: "I think you want to put:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250576 (https://phabricator.wikimedia.org/T114540) (owner: 10Catrope) [22:33:28] (03CR) 10Pietrodn: [C: 031] "Didn't think that the Toolserver URL was still around by now." [puppet] - 10https://gerrit.wikimedia.org/r/250516 (owner: 10Ricordisamoa) [22:34:05] (03PS2) 10Catrope: Re-enable Flow on ptwikibooks, with a hack for the namespace conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250576 (https://phabricator.wikimedia.org/T114540) [22:34:33] (03CR) 10Dzahn: "i don't know about the reasoning to remove it. @ori is that significant for the stable-ness of git.wm.org? @paladox do you miss .bz2 spe" [puppet] - 10https://gerrit.wikimedia.org/r/250447 (owner: 10Paladox) [22:34:57] (03PS1) 10Chad: Fetch scap from Phabricator instead of Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/250578 [22:35:43] (03CR) 10Mattflaschen: [C: 032] Re-enable Flow on ptwikibooks, with a hack for the namespace conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250576 (https://phabricator.wikimedia.org/T114540) (owner: 10Catrope) [22:36:30] (03Merged) 10jenkins-bot: Re-enable Flow on ptwikibooks, with a hack for the namespace conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250576 (https://phabricator.wikimedia.org/T114540) (owner: 10Catrope) [22:37:12] (03CR) 10Dzahn: [C: 031] "i tend to say +1 here, these shortcuts should stay alive and not be broken. that said, i don't know how much performance it costs." [puppet] - 10https://gerrit.wikimedia.org/r/250444 (owner: 10Paladox) [22:37:33] (03CR) 10Chad: "T83702 / git.wm.o aren't in question here, as the commit summary says we're using Phabricator/Diffusion." [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [22:38:20] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Plumbing for wmgFlowEnglishNamespaceOnly (duration: 00m 18s) [22:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:39] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Re-enable Flow on ptwikibooks with English-only Topic namespace name (duration: 00m 18s) [22:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:19] (03PS1) 10Andrew Bogott: Revert "Add labvirt1005 to the nova scheduler pool" [puppet] - 10https://gerrit.wikimedia.org/r/250580 [22:40:27] (03PS2) 10Andrew Bogott: Revert "Add labvirt1005 to the nova scheduler pool" [puppet] - 10https://gerrit.wikimedia.org/r/250580 [22:42:23] (03CR) 10Andrew Bogott: [C: 032] "I don't like how 1005 is acting... need to do a bit more testing." [puppet] - 10https://gerrit.wikimedia.org/r/250580 (owner: 10Andrew Bogott) [22:43:30] (03CR) 10Rush: [C: 031] "I mean I'm down with this but I'm not sure I'm the deciding vote here. But....nice" [puppet] - 10https://gerrit.wikimedia.org/r/250578 (owner: 10Chad) [22:45:21] (03CR) 10Paladox: "@Dzahn I never used it but should be there in case other users would like it. It wasn't causing harm for it to be there." [puppet] - 10https://gerrit.wikimedia.org/r/250447 (owner: 10Paladox) [22:45:46] (03PS5) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [22:46:54] (03CR) 10Paladox: "I think the performance fixes where disabling showing the repo size. But it shows all the branches so doint think its oils cause performan" [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) (owner: 10Paladox) [22:47:03] (03CR) 10Dzahn: swift: no "if $hostname" in node blocks, use role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250072 (owner: 10Dzahn) [22:48:14] (03PS6) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [22:49:56] (03CR) 10Chad: tune gitblit settings to improve performance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [22:50:10] (03PS7) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [22:51:08] (03CR) 10Dzahn: "feels to me that ops should merge this once bd808/greg/releng give a GO for it" [puppet] - 10https://gerrit.wikimedia.org/r/250578 (owner: 10Chad) [22:51:32] (03CR) 10Chad: [C: 031] Re enable bzip2 for gitblit downloads [puppet] - 10https://gerrit.wikimedia.org/r/250447 (owner: 10Paladox) [22:54:01] (03CR) 10BryanDavis: [C: 031] "The manual cleanup should only be needed on Trebuchet deploy hosts (tin, deployment-bastion)." [puppet] - 10https://gerrit.wikimedia.org/r/250578 (owner: 10Chad) [22:54:05] (03CR) 10Chad: tune gitblit settings to improve performance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [22:54:40] (03CR) 10Paladox: tune gitblit settings to improve performance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [22:57:37] (03PS1) 10Dzahn: gitblit: re-disable gravatar support [puppet] - 10https://gerrit.wikimedia.org/r/250583 [22:57:39] (03CR) 10Paladox: "I keep getting internal error for certain repos like this one here." [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [22:58:06] (03CR) 10Dzahn: tune gitblit settings to improve performance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [22:58:47] (03CR) 10Dzahn: [C: 032] gitblit: re-disable gravatar support [puppet] - 10https://gerrit.wikimedia.org/r/250583 (owner: 10Dzahn) [22:59:16] mutante: thx [23:02:09] ostriches: yw. i dont really know about bzip2, but the gravatar thing, yea [23:02:28] (03CR) 10Dzahn: tune gitblit settings to improve performance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250369 (owner: 10Ori.livneh) [23:03:09] Fwiw, his last release was a year ago and he's barely making commits anymore. It's looking to be pretty abandoned now :( [23:03:43] ostriches: who? [23:03:47] gitblit. [23:04:17] ah, ok, wasn't sure if you meant gravatar [23:04:32] jouncebot: next [23:04:32] In 0 hour(s) and 55 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151103T0000) [23:07:31] (03CR) 10Legoktm: [C: 032] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) (owner: 10Legoktm) [23:08:14] (03Merged) 10jenkins-bot: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) (owner: 10Legoktm) [23:08:56] (03CR) 10Peachey88: "@Paladox: When you make a commit related to phabricator task, Could you please link to it in the commit message?" [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [23:08:58] (03CR) 10BryanDavis: [C: 04-1] "Let's not do this quite yet. Discussing freeing space by dropping some noisy logs first (which was the opposite of my reaction to the gene" [puppet] - 10https://gerrit.wikimedia.org/r/250501 (https://phabricator.wikimedia.org/T117438) (owner: 10BryanDavis) [23:11:39] !log Running convertAllLqtPages.php on ptwikibooks [23:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [23:13:05] (03PS2) 10Dzahn: analytics,impala/mysql/refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 [23:13:43] RoanKattouw: After that finishes, can LQT go away on said wikis? [23:14:17] (03CR) 10jenkins-bot: [V: 04-1] analytics,impala/mysql/refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [23:14:24] ostriches: No :( because we have to keep being able to render log entries and stuff [23:14:28] So it won't be able to be uninstalled [23:14:33] But it will go away from a user pov [23:14:36] Ah [23:15:14] I wonder if moving the log i18n would work long term [23:15:27] (03PS3) 10Dzahn: analytics,impala/mysql/refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 [23:15:38] To WikimediaMessages [23:16:13] (03CR) 10jenkins-bot: [V: 04-1] analytics,impala/mysql/refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [23:19:04] (03CR) 10EBernhardson: "ps4 removed a few variables from the test, they were applied globally in I951d43fe3 so we can do testing with just the query string parame" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) (owner: 10Smalyshev) [23:19:27] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1775961 (10Eevans) [23:21:11] (03CR) 10Paladox: "I would but I am not sure if it fixes the problem. It looks like it will but I am not sure." [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [23:22:04] (03PS1) 10Dzahn: analytics/mysql: fix arrow alignments [puppet] - 10https://gerrit.wikimedia.org/r/250586 [23:22:25] (03CR) 10Paladox: "Also it's raw file I've filed a bug for and I doint know if this patche fixes problem. But if you look at gitblit project you will find th" [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [23:24:15] (03CR) 10Dzahn: [C: 032] "only applied on analytics1015, and i want to know where the fail on https://gerrit.wikimedia.org/r/#/c/241318/ even comes from, so reduce " [puppet] - 10https://gerrit.wikimedia.org/r/250586 (owner: 10Dzahn) [23:31:44] (03PS4) 10Dzahn: analytics: impala,refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 [23:32:51] (03CR) 10jenkins-bot: [V: 04-1] analytics: impala,refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [23:33:41] (03CR) 10Dzahn: [C: 04-1] "yea, the refinery class still has:" [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [23:40:27] (03PS1) 10Alex Monk: Add new VE RESTBase URL config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250590 [23:52:22] (03PS3) 10Dereckson: Add *.unesco.org to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246167 (https://phabricator.wikimedia.org/T115338) [23:52:47] (03PS2) 10Dereckson: Enable WikidataPageBanner on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246169 (https://phabricator.wikimedia.org/T115023) [23:53:13] (03CR) 10Dereckson: "PS2: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246169 (https://phabricator.wikimedia.org/T115023) (owner: 10Dereckson) [23:53:24] (03CR) 10Dereckson: "PS2: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246167 (https://phabricator.wikimedia.org/T115338) (owner: 10Dereckson) [23:57:21] (03PS1) 10RobH: adding cyrusone to allowed procurement domains [puppet] - 10https://gerrit.wikimedia.org/r/250594 [23:57:48] 6operations: Make ops-l a list for humans again (no cheating) - https://phabricator.wikimedia.org/T117508#1776107 (10faidon) 3NEW [23:57:55] (03CR) 10RobH: [C: 032] adding cyrusone to allowed procurement domains [puppet] - 10https://gerrit.wikimedia.org/r/250594 (owner: 10RobH) [23:58:11] (03PS1) 10Dereckson: Add www.webarchive.org.uk to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250595 (https://phabricator.wikimedia.org/T116179) [23:58:56] paravoid, I don't think ops-l receives gerrit emails anymore? [23:58:57] 6operations: Make ops-l a list for humans again (no cheating) - https://phabricator.wikimedia.org/T117508#1776126 (10yuvipanda) I think Gerrit ones have been killed for a while and I want to kill the catchpoint ones unless people explicitly object. I think the watchmouse one can go too! That leaves us with priv... [23:59:12] (03CR) 10Jforrester: [C: 031] Add new VE RESTBase URL config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250590 (owner: 10Alex Monk)