[01:13:15] PROBLEM - Host db1061 is DOWN: PING CRITICAL - Packet loss = 100% [02:23:42] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 10m 39s) [02:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 19 02:28:46 UTC 2016 (duration 5m 4s) [02:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [02:37:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:39:06] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:41:36] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:30:29] PROBLEM - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100% [04:37:56] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:56] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:58:47] <_joe_> 2 hosts down; thank you weekend [06:05:07] 06Operations, 10ops-codfw: wtp2019 has faulty memory - https://phabricator.wikimedia.org/T146009#2647413 (10Joe) [06:05:16] 06Operations, 10ops-codfw: wtp2019 has faulty memory - https://phabricator.wikimedia.org/T146009#2647425 (10Joe) p:05Triage>03Low [06:25:07] (03PS1) 10Marostegui: db-eqiad.php: Temporarily replace db1062 wit db1034 in order to be able to execute an ALTER table. It normally takes around 2 hours to execute this ALTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311375 [06:44:44] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647450 (10Marostegui) The sysctl settings error looks gone now, and I can read them actually: ``` root@db1082:/proc/sys/net# sysctl -a | wc -l 1702 ``` The offset error looks weird: ``` root@db1082:/proc/sys/net# nt... [06:57:14] (03PS3) 10Muehlenhoff: deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) [06:59:11] 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011#2647455 (10MoritzMuehlenhoff) [07:20:03] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2647480 (10Marostegui) I have renamed the table in the following codfw hosts: ``` db2034.codfw.wmnet db2042.codfw.wmnet db2048.codfw.wmnet db2055.codfw.wmnet db2062.codfw.wmnet db2069.codfw... [07:29:10] (03PS1) 10DCausse: Use 30g of heap for relforge jvms [puppet] - 10https://gerrit.wikimedia.org/r/311377 [07:29:24] (03PS1) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 [07:29:41] <_joe_> dcausse: 30 gb? wow java 8 has got better I guess [07:40:14] :) [07:42:56] !log reimaging mw1255-mw1257 to jessie [07:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:49:32] !log installing updates for file/libmagic from jessie 8.6 point update [07:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:09] !log reimaging mw1191.eqiad.wmnet to jessie [07:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:03:57] goood morning [08:04:27] !log renaming tables in S1, S4 and S4 in eqiad before dropping them T54924 [08:04:29] T54924: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924 [08:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:04:33] akosiaris: morning happy monday [08:04:59] marostegui: :-) [08:06:10] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:16:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "couple of inline questions. I basically see 2 race conditions we need to make sure won't happen before we consider this safe" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311223 (https://phabricator.wikimedia.org/T85459) (owner: 10Giuseppe Lavagetto) [08:20:17] (03CR) 10Ema: [C: 04-1] puppetdb: expose dashboard via cache-misc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [08:21:12] (03CR) 10Alexandros Kosiaris: "An IP that embeds the "Answer to the Ultimate Question of Life, the Universe, and Everything"[1] ?" [puppet] - 10https://gerrit.wikimedia.org/r/297315 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [08:23:22] 06Operations, 10DBA: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#2647521 (10Marostegui) I have renamed the table in eqiad hosts (the already exists errors are because those hosts were used as canary: S1 - enwiki: ``` root@neodymium:/home/mar... [08:28:38] (03CR) 10Giuseppe Lavagetto: "Those race conditions won't happen AFAICT" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311223 (https://phabricator.wikimedia.org/T85459) (owner: 10Giuseppe Lavagetto) [08:28:44] 06Operations: DegradedArray event on /dev/md/0:copper - https://phabricator.wikimedia.org/T146013#2647530 (10ema) [08:29:34] yuvipanda: did we decide to sync up on the grafana thing today? [08:33:42] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647554 (10Marostegui) NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine [08:36:23] 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011#2647561 (10ema) [08:37:04] (03PS2) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 [08:37:46] (03CR) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [08:39:52] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:53] <_joe_> ema: ^^ fixed [08:43:42] (03CR) 10Ema: [C: 031] puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [08:43:48] _joe_: ship it! [08:43:51] (03CR) 10Addshore: [C: 032] Inline doc for $wgMaxShell* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311118 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [08:45:13] (03PS2) 10Addshore: Inline doc for $wgMaxShell* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311118 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [08:45:20] (03CR) 10Addshore: [C: 032] Inline doc for $wgMaxShell* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311118 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [08:45:44] (03Merged) 10jenkins-bot: Inline doc for $wgMaxShell* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311118 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [08:48:52] !log addshore@tin Synchronized wmf-config/CommonSettings.php: {{gerrit|311118}} NOOP Some inline comments added (duration: 00m 58s) [08:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:49:21] !log increase /var/lib/puppet to 50GB on puppetmaster1002, puppetmaster2001, puppetmaster2002 [08:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:49:46] (03PS1) 10Giuseppe Lavagetto: naggen2: order resources by title in puppetDB as well [puppet] - 10https://gerrit.wikimedia.org/r/311382 [08:52:46] 06Operations: DegradedArray event on /dev/md/0:copper - https://phabricator.wikimedia.org/T146013#2647646 (10faidon) [08:52:48] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2647649 (10faidon) [08:53:48] (03CR) 10Alexandros Kosiaris: [C: 031] naggen2: order resources by title in puppetDB as well [puppet] - 10https://gerrit.wikimedia.org/r/311382 (owner: 10Giuseppe Lavagetto) [09:00:02] (03CR) 10Alexandros Kosiaris: [C: 031] puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [09:03:49] (03CR) 10Alexandros Kosiaris: [C: 031] "Ok, I suppose it's fine then. But let's run a cross fleet PCC to make sure before we merge it." [puppet] - 10https://gerrit.wikimedia.org/r/311223 (https://phabricator.wikimedia.org/T85459) (owner: 10Giuseppe Lavagetto) [09:04:29] 06Operations: PROBLEM - Disk space on puppetmaster1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 1115 MB (3% inode=97%) - https://phabricator.wikimedia.org/T145924#2647686 (10akosiaris) I 've increase the disk space for /var/lib/puppet on the other puppetmasters as well [09:07:06] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:07:09] (03PS1) 10MarcoAurelio: Enable ShortURL on Wikimedia Bangladesh chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311383 (https://phabricator.wikimedia.org/T146014) [09:07:16] (03PS1) 10Faidon Liambotis: mirrors: disable ETag for Tails, at their request [puppet] - 10https://gerrit.wikimedia.org/r/311384 [09:07:46] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: disable ETag for Tails, at their request [puppet] - 10https://gerrit.wikimedia.org/r/311384 (owner: 10Faidon Liambotis) [09:10:19] (03CR) 10MarcoAurelio: "Please consider running 'populateShortUrlTable.php' maintenance script after installation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311383 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [09:20:46] !log invalidated squid cache on carbon [09:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:52] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2647722 (10faidon) 05Open>03Resolved a:03faidon Telkom did not respond despite my repeated emails, but I've just removed the explicit route preference and tra... [09:27:00] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:32:57] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2647745 (10faidon) [09:32:59] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2647746 (10faidon) [09:36:46] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647755 (10jcrespo) >>! In T145533#2647554, @Marostegui wrote: > NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine In fact, I would reboot it several times to see if it happens again... [09:38:20] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647757 (10Marostegui) Makes clear, thanks for giving me context on past issues. I will do that for for a few times and by the end of the day I will give it another final reboot and leave it like that for a few days, just... [09:39:15] db1061 is down, anyone is doing maintanence on it? marostegui ? [09:39:35] jynus: not me [09:40:19] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 06WMDE-Analytics-Engineering, and 2 others: Update wmde LDAP group - https://phabricator.wikimedia.org/T145384#2647758 (10ArielGlenn) 05Open>03Resolved a:03ArielGlenn This is now done. Closing. [09:40:28] <_joe_> jynus: I saw it down, but it's not pooled so I guessed you guys would look [09:40:44] _joe_, thanks, you did exactly the correct thing [09:40:53] (03PS1) 10Ema: varnish: add varnish-fe restart script [puppet] - 10https://gerrit.wikimedia.org/r/311387 [09:41:08] I got up at 4am, but did not have the will to comment it seeing it was depooled [09:41:23] You guys received a text for it? [09:41:48] us people probably did [09:42:09] * volans not [09:42:25] I can take care of that server [09:42:56] jynus: How did you get up at 4am? I was wondering if I should have gotten an sms for it and didn't [09:43:17] no, we shouldn't [09:43:21] I didn't [09:43:56] sadly, this is one of the few servers that should not page, but it is difficult to control that because mediawiki vs infrastructure [09:45:59] I have disabled alerting on db1061 [09:46:08] console looks dead or equivalent [09:46:15] not trace of kernel panic [09:46:25] (that doesn't mean it wasn't) [09:46:37] I will hard-reboot it [09:49:13] I was wondering past week if enabling sar would give us more visibility for this kind of crashes [09:49:23] Have you guys evaluated it in this environment? [09:52:06] marostegui: which metric are you missing? [09:52:18] something something disk space graphite [09:52:31] (or prometheus) [09:54:31] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:54:39] volans: mostly how granular sar is and in the past I have been able to catch stuff via sar that the graphs were not showing, just because of this granularity. Just wondering if it was ever considered :) [09:55:46] I think we should improve the granularity on our monitoring systems ;) [10:01:26] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 06WMDE-Analytics-Engineering, and 2 others: Update wmde LDAP group - https://phabricator.wikimedia.org/T145384#2647770 (10Addshore) Thanks @ArielGlenn! [10:01:31] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [10:14:01] (03PS1) 10Muehlenhoff: Always refresh Package/Release/Translation files [puppet] - 10https://gerrit.wikimedia.org/r/311392 [10:15:46] (03CR) 10Jcrespo: [C: 031] "db1034 should be pooled in the first place at the same time than db1062. When reverting this patch, just add it for the same roles with eq" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311375 (owner: 10Marostegui) [10:16:12] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:37] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2569112 (10ArielGlenn) @Volker_E Just waiting on your signature on https://phabricator.wikimedia.org/L3 done via phabricator, just click on the link and there are instructions. Thanks! [10:20:47] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2647789 (10ArielGlenn) @dpatrick maybe we can get the scripts into puppet or a git repo at any rate as a first step? [10:21:06] (03PS1) 10Aude: Update aude's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311393 [10:23:51] !log powercycle db1061, unresponsive since ~1am [10:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:25] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2640942 (10ArielGlenn) @Samwalton9 (Hmm you have two accounts in here, which should we be using?), we need to get manager signoff here on the ticket. I guess this is your first acces... [10:25:10] volans, I think prometheus will allow that; 1 minute resolution, plus stored data > plotted data [10:26:48] 06Operations, 10Ops-Access-Requests: access request for debt on stat1003, stat1002, and fluorine - https://phabricator.wikimedia.org/T145914#2644948 (10ArielGlenn) @debt, we need to get your manager approval, have you sign the L3 document in phabricator, (https://phabricator.wikimedia.org/L3) and the other st... [10:26:50] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:27:37] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2647797 (10elukey) This request looks good to me, I can approve on behalf of the Analytics team (will also notify them). Next step is to wait for ops meeting... [10:29:06] (03CR) 10NahidSultan: [C: 031] Enable ShortURL on Wikimedia Bangladesh chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311383 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [10:30:37] jynus: I think marostegui was looking for sub-minutes stuff... [10:31:42] oh [10:33:15] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 06WMDE-Analytics-Engineering, and 2 others: Update wmde LDAP group - https://phabricator.wikimedia.org/T145384#2647802 (10Tobi_WMDE_SW) Cool, thx! [10:34:12] (03PS4) 10ArielGlenn: set up static nfs lock manager ports for dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/310059 [10:35:21] PROBLEM - mediawiki-installation DSH group on mw1256 is CRITICAL: Host mw1256 is not in mediawiki-installation dsh group [10:35:40] PROBLEM - mediawiki-installation DSH group on mw1255 is CRITICAL: Host mw1255 is not in mediawiki-installation dsh group [10:36:13] PROBLEM - puppet last run on mw1257 is CRITICAL: Timeout while attempting connection [10:36:25] (03CR) 10ArielGlenn: [C: 032] set up static nfs lock manager ports for dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/310059 (owner: 10ArielGlenn) [10:36:39] PROBLEM - mediawiki-installation DSH group on mw1191 is CRITICAL: Host mw1191 is not in mediawiki-installation dsh group [10:37:21] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 4 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [10:40:35] mw1191 is mine! [10:40:36] (03CR) 10Faidon Liambotis: [C: 032] Always refresh Package/Release/Translation files [puppet] - 10https://gerrit.wikimedia.org/r/311392 (owner: 10Muehlenhoff) [10:42:02] PROBLEM - Apache HTTP on mw1257 is CRITICAL: Connection refused [10:42:09] elukey: mw125[5-7] just silenced [10:42:52] moritzm: elukey, did you hit T145192 while re-imaging? sorry I forgot to mention it before [10:42:53] T145192: icinga-downtime script waiting forever if host already in downtime - https://phabricator.wikimedia.org/T145192 [10:43:30] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:44:17] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2647833 (10Samwalton9) @ArielGlenn I'm pretty sure I just have this one, though I may have made an account linked to my WMF account at some point. Either way, this is the one I use. Ap... [10:44:31] RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.113 second response time [10:46:33] volans: probably not, but not sure. at least it didn't have a notable effect on running wmf_auto_reimage [10:46:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:48:04] moritzm: then probably not, atm the hosts for which setting icinga downtime fails will not be reimaged to avoid unnecessary paging [10:48:57] (03PS2) 10Marostegui: db-eqiad.php: Temporarily replace db1062 with db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311375 (https://phabricator.wikimedia.org/T141951) [10:49:24] !log reimaging mw1249, mw1250, mw1258 to jessie [10:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:17] looking into scb1001 [10:51:41] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:52:08] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647837 (10jcrespo) Just one thing, rebooting would be a great way to test https://gerrit.wikimedia.org/r/#/c/310564/ In fact, I am going to test it on db1061 now, too. [10:57:10] 06Operations, 10ops-eqiad, 10DBA: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647838 (10jcrespo) [10:57:22] 06Operations, 10ops-eqiad, 10DBA: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647853 (10jcrespo) [10:58:21] (03CR) 10Odder: "There now appears to be enough community consensus on Commons to move forward with this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder) [10:58:26] (03CR) 10Odder: "There now appears to be enough community consensus on Commons to move forward with this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [10:58:44] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:58:44] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:59:53] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647854 (10Marostegui) Sounds good - I have rebooted it twice already and expect to do a few more before the end of the day. [11:02:23] (03CR) 10MarcoAurelio: [C: 04-1] "Per I6b0ce297. Do not merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309062 (https://phabricator.wikimedia.org/T144927) (owner: 10محمد شعیب) [11:05:50] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Temporarily replace db1062 with db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311375 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:09:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Temporarily replace db1062 with db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311375 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:10:09] (03Merged) 10jenkins-bot: db-eqiad.php: Temporarily replace db1062 with db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311375 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:11:45] (03PS1) 10Phuedx: Zero: Make remote config explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311398 (https://phabricator.wikimedia.org/T145227) [11:13:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Temporarily depool db1062 and repool db1034, in order to be able to ALTER a large table. T141951 (duration: 00m 48s) [11:13:08] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [11:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:29] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:23:20] (03PS2) 10Aude: Update aude's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/311393 [11:23:28] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:23:30] (03PS1) 10ArielGlenn: access to stats1002/3 and fluorine for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) [11:24:42] (03CR) 10jenkins-bot: [V: 04-1] access to stats1002/3 and fluorine for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [11:24:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628342 (10ArielGlenn) Is this a sudo request? I thought not; if not we don't need to wait for the meeting. I've got a patchset there bu... [11:26:16] 06Operations, 10ops-eqiad, 10DBA: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647878 (10jcrespo) a:03jcrespo [11:26:20] (03PS2) 10ArielGlenn: access to stats1002/3 and fluorine for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) [11:26:23] 06Operations, 10ops-eqiad, 10DBA: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2647838 (10jcrespo) p:05Triage>03Normal [11:28:40] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628342 (10MoritzMuehlenhoff) The restricted group to which she is added in https://gerrit.wikimedia.org/r/#/c/311400/ enables a sudo rul... [11:30:53] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2647883 (10ArielGlenn) All righty, adding it to the meeting agenda. [11:31:13] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2647884 (10Cadar) Hi Faidon, thanks for the support on the access question. We seem to be up and connected to Wikipedia fine on this side. I've noticed similar issu... [11:32:17] !log rebooting again db1061 for upgrade [11:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2647885 (10elukey) The restricted group is probably not the best one since we'd grant privileges: ['ALL = (www-data,apache) NOPASSWD: ALL... [11:35:16] PROBLEM - salt-minion processes on mw1258 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:36:45] RECOVERY - mediawiki-installation DSH group on mw1256 is OK: OK [11:37:06] RECOVERY - mediawiki-installation DSH group on mw1255 is OK: OK [11:43:04] !log restbase cassandra truncating local_group_wikipedia_T_feed_aggregated.data [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:43:41] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [11:43:57] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:43:57] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [11:43:57] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [11:43:57] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [11:43:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [11:43:58] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [11:44:06] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [11:44:15] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:44:16] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:44:26] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [11:44:27] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:44:35] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [11:44:46] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [11:44:46] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [11:44:59] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [11:45:31] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [11:45:31] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [11:45:31] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:45:31] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [12:08:05] RECOVERY - salt-minion processes on mw1258 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:09:16] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:12:22] (03PS1) 10ArielGlenn: remove nfs mount of labstore host from datasets [puppet] - 10https://gerrit.wikimedia.org/r/311403 [12:14:17] (03CR) 10Mobrovac: [C: 04-1] "This patch allows one to use a different key for the same user. I am not sure this is a smart thing to allow. We can easily get into probl" [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [12:18:44] (03CR) 10ArielGlenn: [C: 032] remove nfs mount of labstore host from datasets [puppet] - 10https://gerrit.wikimedia.org/r/311403 (owner: 10ArielGlenn) [12:26:40] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647949 (10Marostegui) Everytime the server gets restarted NTP alerts until I run the ntp sync manually. I have rebooted again to see how it comes back and what happens if I do not touch it. [12:31:17] (03CR) 10Muehlenhoff: [C: 032] deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [12:31:22] (03PS4) 10Muehlenhoff: deployment_server: Daemonise redis when running on systemd [puppet] - 10https://gerrit.wikimedia.org/r/311108 (https://phabricator.wikimedia.org/T144578) [12:32:33] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647953 (10MoritzMuehlenhoff) How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart of NTP- [12:32:36] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647954 (10jcrespo) @Marostegui NTP being off for some minutes is "normal" (Known limitation with low priority) What it was an issue/strange is it being off for hours/days. [12:33:43] (03CR) 10Elukey: "Reporting some details of the conversation with Marko in IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [12:33:46] apergos: ok to puppet-merge your patch along? [12:33:48] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647961 (10jcrespo) @MoritzMuehlenhoff see above: >>! In T145533#2645185, @jcrespo wrote: >> Check size of conntrack table >> >> Notifications for this service have been disabled >> WARNING 2016-09-17 03:19:29 1d 13... [12:33:54] oh heck [12:33:55] yes please [12:33:58] ok [12:34:20] I ran puppet on the hosts and forget the actual merge :-/ [12:34:38] now merged [12:34:55] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2647964 (10Marostegui) >>! In T145533#2647953, @MoritzMuehlenhoff wrote: > How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart... [12:40:08] thank you [12:40:12] going to go rerun puppet :-/ [12:51:06] hashar: ready for swat? ;) what's the plan for today? [12:51:45] !log adding mw1191 back to serving traffic after reimage [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:43] (03Abandoned) 10Marostegui: Install timelimit package in the database servers. It is useful to limit the execution time of a script or external scripts, ie: tcpdump to capture traffic for a given time and not for a given amount of packages or mb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311093 (owner: 10Marostegui) [12:52:46] PROBLEM - NFS on ms1001 is CRITICAL: Connection refused [12:53:01] zeljkof: push the stuff as usual :D [12:53:16] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:54:33] hashar: what I wanted to ask: are you doing the swat, should I, do we pair? [12:54:53] (03CR) 10BBlack: [C: 04-1] puppetdb: expose dashboard via cache-misc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [12:56:28] zeljkof: doing some data collection for last week incident [12:56:31] so I guess you can ? [12:58:57] zeljkof: I have CR+2 the couple mediawiki changes for mobrovac [12:59:18] !log installing wget updates from jessie 8.6 point update [12:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:25] eg https://gerrit.wikimedia.org/r/311405 and https://gerrit.wikimedia.org/r/311406 [13:00:01] hashar: ok, will do my best! [13:00:04] hashar, Dereckson, addshore, and aude: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160919T1300). [13:00:04] Urbanecm, yurik, and mobrovac: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:33] Present [13:00:59] hashar: already syncing my patches? [13:01:09] mobrovac: they are in ci [13:01:12] k [13:01:46] (03CR) 10Hashar: [C: 031] Throttling rule for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311086 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:01:58] (03CR) 10Hashar: [C: 031] [throttle] Allow the same number of accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311087 (owner: 10Urbanecm) [13:02:12] (03CR) 10Hashar: [C: 031] Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311088 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:02:24] zeljkof: you can do all three Urbanecm patches in one go ^ :] [13:02:25] here [13:02:47] yurik: https://gerrit.wikimedia.org/r/#/c/311373/ is against wmf.19 [13:03:02] yurik: we do not run wmf.19 yet, the train got halted last week. Is that needed on wmf.18? [13:03:06] RECOVERY - NFS on ms1001 is OK: TCP OK - 0.007 second response time on port 2049 [13:03:16] I can SWAT today! [13:03:20] hashar, oops, it should be 18, rebasing... [13:03:27] yurik: actually both [13:03:44] (03CR) 10Mobrovac: "I think you are confusing the service user with the deployment user, Elukey. Your problem can be solved by either:" [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [13:03:47] yurik: we currently run wmf.18. Will hopefully get wmf.19 this week. So the patch would have to be in both :] [13:03:52] hashar, its needed in the latest [13:04:09] ok [13:04:21] i can repick it to 18 as well, just to cover all bases [13:04:34] hashar: ok, should I start with patches from Urbanecm? [13:04:44] but there is 5 of them, not 3 [13:04:56] not sure what you wanted to say [13:05:02] hashar, 311410 [13:05:10] Please tell me when I'll be needed. Thanks in advance. [13:05:12] i will updated depl page [13:05:16] 06Operations: Rebuild apache2 against new version from jessie point update - https://phabricator.wikimedia.org/T146023#2648032 (10MoritzMuehlenhoff) [13:05:32] Urbanecm: looking at your patches now [13:05:45] is there puppet swat today? [13:06:02] Okay. [13:06:09] Urbanecm: can you test your patches at mw1099? [13:07:51] I'll test all patches I can. [13:08:29] zeljkof: all the throttling ones are good to go :] [13:08:55] Ehm. Are they live on mw1099 now? I can't see any change. [13:08:57] the throttling ones, we can not test them [13:09:15] But I can test 309823 for example. [13:09:21] (03CR) 10Hashar: [C: 031] Enable WikidataPageBanner on itwikiwoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309823 (https://phabricator.wikimedia.org/T145328) (owner: 10Urbanecm) [13:09:22] Or what patches are you processing? [13:09:28] Config one or throttling one? [13:09:34] Urbanecm: sorry, I am new to deployments, and slow... working on it [13:09:35] I can't test throttling patches anyway. [13:09:48] Okay, maybe I'm faster than I should be ;) [13:09:52] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648055 (10BBlack) Where we're at on basic bug-mechanics explanation: 1. `LRU_Fail` has nothing to do with the nuke_limit. `LRU_Fail` means that within an atte... [13:09:54] (03CR) 10Hashar: [C: 031] Add WT namespace alias to NS_PROJECT in mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311327 (https://phabricator.wikimedia.org/T140998) (owner: 10Urbanecm) [13:10:10] hashar: Why are you marking my patches as +1? [13:10:15] (03CR) 10Yurik: [C: 031] "looks good and can be deployed at any time (will be a noop at first)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311398 (https://phabricator.wikimedia.org/T145227) (owner: 10Phuedx) [13:10:36] PROBLEM - NFS on ms1001 is CRITICAL: Connection refused [13:10:44] zeljkof: I am pushing the couple mediawiki/core changes mobrovac proposed while you are handling the mediawiki-config changes :) [13:11:12] hashar and zeljkof: Who is deploying? I'm confused. [13:11:27] Urbanecm: looks like both of us [13:12:04] Urbanecm: merging your throttling changes, will ping you in a few minutes [13:12:09] Okay. [13:12:13] (03CR) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [13:12:23] (03PS2) 10Zfilipin: [throttle] Allow the same number of accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311087 (owner: 10Urbanecm) [13:12:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2648067 (10MelodyKramer) I don't think I need flourine access. Just the Webrequest logs! Sorry, and thanks for clarifying Luca! Mel [13:13:00] mobrovac: your patches are on mw1099 if that is testable there. Else I will just sync the whole cluster :) [13:13:12] hashar: sure, lemme test [13:13:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311087 (owner: 10Urbanecm) [13:14:19] (03Merged) 10jenkins-bot: [throttle] Allow the same number of accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311087 (owner: 10Urbanecm) [13:14:43] (03PS4) 10Zfilipin: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311088 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:15:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311088 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:15:47] hashar: all good, sync it away! [13:16:05] (03Merged) 10jenkins-bot: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311088 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:17:26] Urbanecm: ok, looks like I have messed up :( [13:17:33] What happened? [13:17:33] what is happoening? [13:17:47] I have merged the first two patches on the list [13:17:48] but the third one now says there is a merge conflict [13:18:03] !log hashar@tin Synchronized php-1.28.0-wmf.18/includes/api/ApiQueryBacklinksprop.php: API: Force straight join for prop=linkshere|transcludedin|fileusage T145079 (duration: 00m 50s) [13:18:03] rebase it ! ? [13:18:04] T145079: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079 [13:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:14] I can't see any conflict... [13:18:17] hashar: merge conflict while rebasing [13:18:23] where ? [13:18:27] Number please :) [13:18:27] this one https://gerrit.wikimedia.org/r/#/c/311086/ [13:18:41] hashar: I [13:18:46] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:18:49] hashar: I'll merge it manually, okay? [13:18:57] !log hashar@tin Synchronized php-1.28.0-wmf.19/includes/api/ApiQueryBacklinksprop.php: API: Force straight join for prop=linkshere|transcludedin|fileusage T145079 (duration: 00m 47s) [13:19:02] mobrovac: done [13:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:08] this two merged fine https://gerrit.wikimedia.org/r/#/c/311087/ https://gerrit.wikimedia.org/r/#/c/311088/ [13:19:09] Urbanecm: yeah [13:19:19] Okay, working on it hashar [13:19:53] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 7 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619222 (10hashar) Deployed on current wmf.18 as well as next wmf.19. [13:20:07] git-review -x 311086 && git-review [13:20:09] magic :] [13:20:17] (03PS2) 10Giuseppe Lavagetto: naggen2: order resources by title in puppetDB as well [puppet] - 10https://gerrit.wikimedia.org/r/311382 [13:20:23] Urbanecm, hashar: I have assumed that the patches should be merged according to the order on the deployments page [13:20:33] well [13:20:39] what I do usually [13:20:54] on my local machine I update the repo / get to tip of branch [13:21:06] then cherry pick each of the listed patches in order with git-review -x [13:21:16] then send that back to Gerrit and CR+2 each of them [13:21:31] This isn't possible. There is merge conflict which is needed to be solved manually. [13:22:18] So you must rebase it with git rebase master, then open throttle.php and fix conflicts [13:22:42] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648085 (10BBlack) In the meta sense, it seems like it's a "known issue" that the `file` storage backend simply doesn't scale for certain combinations of workloa... [13:22:48] Urbanecm: let me know when you have fixed merge conflict for 311086 [13:22:52] Okay [13:23:24] hashar: you have deployed mobrovac's couple of patches? [13:23:37] yes [13:23:53] ok, great, and thanks [13:24:46] (03PS4) 10Urbanecm: Throttling rule for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311086 (https://phabricator.wikimedia.org/T145838) [13:25:16] zeljkof: It should be done in PS4 [13:25:26] Urbanecm: great, on it [13:25:27] Please check it twice :) [13:25:32] (03PS3) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 [13:25:40] Urbanecm: anything specific to check for? [13:26:10] Please compare the PS3 version and PS4. It shouldn't differ. [13:26:30] Because PS4 is a rebase even it isn't marked as it. [13:26:33] thnx hashar! [13:26:53] Urbanecm: this? https://gerrit.wikimedia.org/r/#/c/311086/3..4/wmf-config/throttle.php [13:26:57] but there is difference [13:27:41] mobrovac: off topic, I will be in Rijeka Thursday-Sunday, are you around? ;) [13:28:10] Yes... [13:28:34] Urbanecm: I am confused, should I merge 311086? or is there a problem? [13:28:44] Looking at it, please wait a moment. [13:28:56] Urbanecm: sure [13:29:27] zeljkof: i'm in morocco :) [13:29:39] mobrovac: nevermind then :D [13:29:44] (03PS1) 10Elukey: Improve resilience during varnish restarts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [13:30:01] zeljkof: Look at https://gerrit.wikimedia.org/r/#/c/311086/4/wmf-config/throttle.php. Why it is changing the rule which already exist? This shouldn't be the correct state... [13:30:03] Going to fix it. [13:30:33] Urbanecm: not sure what you did during rebase :) [13:30:46] RECOVERY - NFS on ms1001 is OK: TCP OK - 0.000 second response time on port 2049 [13:30:50] I know... [13:30:55] But I'm only thinking loudly. [13:31:36] Urbanecm: the best way to think during deployment! :) [13:33:08] Not only during it :) [13:33:48] (03PS2) 10Elukey: Improve resilience during varnish restarts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [13:34:02] I think 311086 can be deployed. I've checked throttle.php with tasks and it fits... [13:34:21] Urbanecm: ok, merging it then [13:34:27] Okay, thanks. [13:35:05] hashar, if you pinged, my net went down [13:35:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311086 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:35:32] it must be accident :) [13:35:42] (03Merged) 10jenkins-bot: Throttling rule for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311086 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [13:35:52] Urbanecm: will be at mw1099 in a minute, can you test there? [13:36:09] zeljkof: Throttling patches can't be tested at mw1099 [13:36:15] It must be synced to the whole cluster. [13:36:31] Urbanecm: ok, in that case pushing to the entire known universe [13:36:34] Okay. [13:36:35] can you test there? ;) [13:36:52] No, throttling patches aren't testable. [13:37:16] (03PS3) 10Elukey: Improve resilience during varnish restarts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [13:37:20] (03PS4) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 [13:38:44] (03PS1) 10Giuseppe Lavagetto: Add records for puppetdb-$site.w.o [dns] - 10https://gerrit.wikimedia.org/r/311416 [13:39:36] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648172 (10BBlack) Currently we're surviving by scheduling a very conservative (in the sense of avoiding LRU_Fail) daily restart of all the cache_upload backend... [13:41:05] RECOVERY - mediawiki-installation DSH group on mw1191 is OK: OK [13:44:47] (03PS1) 10Urbanecm: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311417 (https://phabricator.wikimedia.org/T145838) [13:47:01] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:311086|Throttling rule for RCL (T145838)]] [[gerrit:311087|[throttle] Allow the same number accounts]] [[gerrit:311088|Throttle for RCL]] (duration: 00m 47s) [13:47:03] T145838: IP lift cap for Research Commons Librarian - 19/09/16 and 21/09/16 - https://phabricator.wikimedia.org/T145838 [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:33] Urbanecm: throttling commits deployed [13:47:45] moving to your config changes [13:47:48] zeljkof: Okay. [13:47:54] (03Abandoned) 10Giuseppe Lavagetto: puppetdb: expose dashboard via cache-misc [puppet] - 10https://gerrit.wikimedia.org/r/311378 (owner: 10Giuseppe Lavagetto) [13:48:04] apologies for being slow, I am new to deployments [13:48:27] Nothing happened. [13:48:45] Urbanecm: what do you mean? what should happen? [13:48:59] Nothing happened due to your slowness. [13:49:06] Urbanecm: oh, I see [13:49:06] Is it more clear? [13:49:13] it is [13:49:17] Great [13:49:40] yurik: have you got your patch cherry picked/rebased? [13:49:41] I thought you were testing something and did not see a change you were expecting [13:49:54] hashar, both patches are ready [13:50:05] Urbanecm: can you test your config changes at mw1099? [13:50:16] Are they live? [13:50:19] (03PS1) 10KartikMistry: apertium-br-fr: Rebuild for Jessie [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/311418 [13:50:20] yurik: will do them while zeljko baby sit the mw config ones [13:50:22] (I mean now) [13:50:33] kk [13:50:42] (03PS3) 10Zfilipin: Enable WikidataPageBanner on itwikiwoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309823 (https://phabricator.wikimedia.org/T145328) (owner: 10Urbanecm) [13:50:59] zeljkof: But yes, I can test all of my config changes at mw1099. [13:51:04] Urbanecm: not yet, just asking [13:51:11] Okay. Yes, I'm able to do it. [13:51:14] great, as soon as CI is done, I will ping you [13:51:18] Okay. [13:51:38] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309823 (https://phabricator.wikimedia.org/T145328) (owner: 10Urbanecm) [13:52:03] (03Merged) 10jenkins-bot: Enable WikidataPageBanner on itwikiwoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309823 (https://phabricator.wikimedia.org/T145328) (owner: 10Urbanecm) [13:53:34] yurik: ok they are on mw1099 :) [13:53:48] yurik: please test there and remember wikis run wmf.18. [13:53:58] if that is ok I will sync the whole cluster [13:54:02] (03CR) 10Elukey: "So what I wanted to do was to have pivot/pivot as user/group of these files but with correct permissions I might also use something like d" [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [13:54:02] testing... [13:54:07] Urbanecm: 309823 is on mw1099, please test and let me know if I can push further [13:54:15] Okay, testing... [13:54:43] (03PS2) 10Zfilipin: Add WT namespace alias to NS_PROJECT in mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311327 (https://phabricator.wikimedia.org/T140998) (owner: 10Urbanecm) [13:55:12] hashar, good to go [13:55:15] !log Europe SWAT extended as we still have some patches to process [13:55:18] yurik: neat :] [13:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:39] zeljkof: WikidataPageBanner appeared in Special:Version so probably all is ok. [13:55:54] Urbanecm: great, deploying [13:56:10] Okay, thanks! [13:56:29] yurik: I am letting them deploy the other changes. Will sync Graph after and poke you back once done :] [13:56:50] ok [13:56:52] thx [13:57:08] hashar, i also plan to do a graphoid depl via scap3, but that shouldn't affect anyone [13:57:34] cc: gehel [13:58:36] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:309823|Enable WikidataPageBanner on itwikiwoyage (T145328)]] (duration: 00m 48s) [13:58:38] T145328: Activation of WikidataPageBanner extension on Italian Wikivoyage - https://phabricator.wikimedia.org/T145328 [13:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:52] Urbanecm: deployed [13:58:55] (03CR) 10Mobrovac: "Yes, this is exactly my proposition. Moreover, there is consensus that using the same user for deploying and running the service is a bad " [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [13:58:56] moving to the last commit [13:59:00] Thanks, going to close. [13:59:35] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311327 (https://phabricator.wikimedia.org/T140998) (owner: 10Urbanecm) [14:00:12] (03Merged) 10jenkins-bot: Add WT namespace alias to NS_PROJECT in mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311327 (https://phabricator.wikimedia.org/T140998) (owner: 10Urbanecm) [14:00:22] Urbanecm: sorry, did not understand, what are you going to close? [14:00:50] !log hashar@tin Synchronized php-1.28.0-wmf.18/extensions/Graph: Fixed wikiraw: protocol bug T146010 (duration: 00m 48s) [14:00:52] T146010: Graphs with wikiraw: protocol no longer work - https://phabricator.wikimedia.org/T146010 [14:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:56] (03CR) 10Elukey: "Yes definitely this is a good point" [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [14:00:57] zeljkof: The task! [14:01:04] Urbanecm: oh :) [14:01:06] makes sense [14:01:09] (03Abandoned) 10Elukey: Add the deployment_key class variable to service::node [puppet] - 10https://gerrit.wikimedia.org/r/311139 (owner: 10Elukey) [14:01:33] Yeah :) [14:01:45] !log hashar@tin Synchronized php-1.28.0-wmf.19/extensions/Graph: Fixed wikiraw: protocol bug T146010 (duration: 00m 47s) [14:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:20] Urbanecm: 311327 is at mw1099, please test [14:02:28] testing... [14:03:01] zeljkof: take as much extra time as needed :] [14:03:03] we can overlap [14:03:09] It works zeljkof [14:03:14] hashar: finishing :) [14:03:19] Urbanecm: great, deploying... [14:03:24] Thanks [14:04:32] (03PS1) 10KartikMistry: apertium-nno-nob: Rebuild for Jessie [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/311422 [14:05:07] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:311327|Add WT namespace alias to NS_PROJECT in mywiktionary (T140998)]] (duration: 00m 47s) [14:05:08] T140998: set 'WT' namespace alias to NS_PROJECT in my.wiktionary - https://phabricator.wikimedia.org/T140998 [14:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:22] Urbanecm, hashar: everything deployed [14:05:34] zeljkof: Thanks! [14:05:45] !log EU SWAT finished [14:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:00] thanks morebots [14:06:06] (03PS2) 10KartikMistry: apertium-br-fr: Rebuild for Jessie [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/311418 [14:06:15] Urbanecm: thanks for the patience, things will get faster with time :) [14:06:22] morebots is a bot, it can't read :) [14:06:22] I am a logbot running on tools-exec-1212. [14:06:23] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:06:23] To log a message, type !log . [14:06:43] thanks for the help hashar, we would not deploy half if I was the only one deploying :) [14:06:54] Urbanecm: I know, it's a joke between the two of us ;) [14:07:00] :) [14:07:06] * zeljkof winks at morebots [14:07:28] Thanks for all deploys anywas and thanks all interested people! [14:07:56] I am checking the logs to see if anything spikes [14:08:07] and taking a short break [14:09:09] urandom: all done? [14:09:45] (03PS4) 10Elukey: Improve resilience during varnish restarts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [14:10:16] !log European SWAT is complete. [14:10:21] assuming it is complete :D [14:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:47] !log reimaging mw1246-mw1248 to jessie [14:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:28] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: decom titanium - https://phabricator.wikimedia.org/T145666#2648309 (10Ottomata) :O :) Thank you for everybody’s help on this! [14:21:01] !log adding aqs1004 to live traffic - aqs.svc.eqiad.wmnet - T144497 [14:21:03] T144497: Switch AQS to new cluster - https://phabricator.wikimedia.org/T144497 [14:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:50] (03CR) 10Ottomata: "K sounds like Filippo is out today, so hopefully we can do this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [14:21:54] hashar, did you depl graph stuff? [14:22:27] (03Abandoned) 10Ottomata: Attempt to move eventbus::eventbus role back to just eventbus where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/260038 (owner: 10Ottomata) [14:22:34] (03PS3) 10Giuseppe Lavagetto: naggen2: order resources by title in puppetDB as well [puppet] - 10https://gerrit.wikimedia.org/r/311382 [14:22:36] (03PS1) 10Giuseppe Lavagetto: puppetdbquery: add ability to order resources [puppet] - 10https://gerrit.wikimedia.org/r/311423 [14:22:38] (03PS1) 10Giuseppe Lavagetto: ssh::client: order by host name with puppetdb as well [puppet] - 10https://gerrit.wikimedia.org/r/311424 [14:23:03] gehel, would i step on your toes if i deploy graphoid via scap3? [14:23:26] (03Abandoned) 10Ottomata: [WIP] Proof of concept for reqstats overhaul using varnishncsa instances per metric [puppet] - 10https://gerrit.wikimedia.org/r/208192 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [14:23:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] naggen2: order resources by title in puppetDB as well [puppet] - 10https://gerrit.wikimedia.org/r/311382 (owner: 10Giuseppe Lavagetto) [14:24:58] (03CR) 10Ottomata: "OH, somehow I missed this review long ago. Yes 100% ready to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [14:25:40] (03Abandoned) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [14:25:46] yurik: should [14:25:59] hashar, looks good, just tested [14:26:43] (03CR) 10Daniel Kinzler: "How confident are we about the conversion factors given here? Changing conversion factors later is a major pain, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (owner: 10Smalyshev) [14:26:55] yurik: yeah confirmed your patch is on both .18 and .19 [14:27:05] thanks! [14:27:12] 06Operations, 06Services, 13Patch-For-Review, 07Service-deployment-requests, 15User-mobrovac: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2648325 (10mobrovac) p:05Triage>03Normal [14:28:27] !log uploaded apache 2.4.10-10+deb8u7+wmf1 for jessie-wikimedia to carbon [14:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:56] 06Operations: Rebuild apache2 against new version from jessie point update - https://phabricator.wikimedia.org/T146023#2648341 (10MoritzMuehlenhoff) 05Open>03Resolved Rebuilt and uploaded [14:31:33] thanks moritzm! [14:31:44] yurik: sorry for the delay (me on vacation today), no, please go ahead! [14:32:04] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetdbquery: add ability to order resources [puppet] - 10https://gerrit.wikimedia.org/r/311423 (owner: 10Giuseppe Lavagetto) [14:32:08] gehel, ah, enjoy! you are in the deployment schedule :) i will take over it then :) [14:34:33] (03CR) 10Giuseppe Lavagetto: [C: 032] ssh::client: order by host name with puppetdb as well [puppet] - 10https://gerrit.wikimedia.org/r/311424 (owner: 10Giuseppe Lavagetto) [14:35:02] (03CR) 10Ottomata: Improve resilience during varnish restarts (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) (owner: 10Elukey) [14:35:40] !log depl graphoid https://gerrit.wikimedia.org/r/#/c/311374/ [14:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:25] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.203 second response time [14:38:13] (03CR) 10Elukey: Improve resilience during varnish restarts (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) (owner: 10Elukey) [14:40:37] (03PS5) 10Elukey: Improve resilience during varnish restarts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [14:40:48] !log testing nfs export performance on labstore1004/1005 cluster [14:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:16] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.321 second response time [14:49:26] (03CR) 10Alex Monk: "Nope, just wondered if it meant anything. Was looking at IPs in the repo similar to labs ones." [puppet] - 10https://gerrit.wikimedia.org/r/297315 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [14:52:05] (03CR) 10Alex Monk: "Why add full restricted rights to allow fluorine login when you could just add mw-log-readers?" [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [14:52:48] (03CR) 10Alex Monk: "There's also no wait for sudo discussion that way" [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [14:53:47] 06Operations, 10Ops-Access-Requests: access request for debt on stat1003, stat1002, and fluorine - https://phabricator.wikimedia.org/T145914#2648423 (10Deskana) Approved. [14:56:26] PROBLEM - Varnishkafka log producer on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:57:45] PROBLEM - Varnishkafka log producer on cp3035 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:01:39] (03PS1) 10Giuseppe Lavagetto: puppetdbquery: fix argument ordering [puppet] - 10https://gerrit.wikimedia.org/r/311429 [15:05:24] PROBLEM - mediawiki-installation DSH group on mw1248 is CRITICAL: Host mw1248 is not in mediawiki-installation dsh group [15:05:53] PROBLEM - mediawiki-installation DSH group on mw1247 is CRITICAL: Host mw1247 is not in mediawiki-installation dsh group [15:06:23] PROBLEM - mediawiki-installation DSH group on mw1246 is CRITICAL: Host mw1246 is not in mediawiki-installation dsh group [15:06:25] 06Operations: Have Diamond collect Linux KSM metrics on Ganeti hosts - https://phabricator.wikimedia.org/T146038#2648454 (10hashar) [15:06:47] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetdbquery: fix argument ordering [puppet] - 10https://gerrit.wikimedia.org/r/311429 (owner: 10Giuseppe Lavagetto) [15:07:12] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:23] PROBLEM - salt-minion processes on mw1247 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:10:42] PROBLEM - Apache HTTP on mw1246 is CRITICAL: Connection refused [15:11:23] PROBLEM - DPKG on mw1246 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:11:42] RECOVERY - Varnishkafka log producer on cp3034 is OK: PROCS OK: 1 process with command name varnishkafka [15:12:32] PROBLEM - Apache HTTP on mw1248 is CRITICAL: Connection refused [15:13:06] RECOVERY - Varnishkafka log producer on cp3035 is OK: PROCS OK: 1 process with command name varnishkafka [15:13:23] RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.677 second response time [15:13:46] RECOVERY - DPKG on mw1246 is OK: All packages OK [15:15:02] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.076 second response time [15:17:23] hashar: with one? [15:17:31] gotta find a jobrunner :( [15:17:36] the dsh groups are gone apparently [15:17:40] mw1001 [15:17:52] PROBLEM - Apache HTTP on mw1247 is CRITICAL: Connection refused [15:17:54] that and mw1016, mw1010 are my favs [15:17:56] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2648503 (10Krenair) [15:18:01] bah [15:18:06] apparently mw1001-16 have been decomissioned [15:18:32] https://phabricator.wikimedia.org/T139353 [15:18:33] for real? [15:18:40] no favourite allowed :O [15:18:48] * hashar digs in puppet [15:19:01] yeah just look at site.pp [15:19:08] so we have mw1161-1169 mw1299 mw3000-mw3006 [15:19:12] I am going to pick mw1299 :D [15:19:27] I remember ori added mw1161-1169 [15:19:34] I guess the 3k ones are new [15:19:47] another thing I noticed is that the jobrunner services only write to /var/log/mediawiki/ on each server [15:20:00] and it does not seem their logs are sent up to the central syslog / logstash :( [15:20:23] $ tail jobrunner.log [15:20:23] tail: cannot open ‘jobrunner.log’ for reading: Permission denied [15:20:36] ok [15:20:43] AaronSchulz: if I do $services->getDBLoadBalancerFactory()->getExternalLB( 'clusterName', 'dbname' )->getConnection()->select(/**/) would that be running a query on the db with the name 'dbname' ? I'm getting turned around a bit bu the wiki vs dbname areas! [15:20:55] that is enough. I am switching carreer, buying a field and going to grow those tomatoes with addshore [15:21:16] hashar: :D Can be grow olives and grapes too? ;) [15:21:26] *we, man I need a drink.... [15:21:40] addshore: once we have a second round of investor and lot of VC money to "waste" on new fields and olives/grapes. Yeah surely [15:22:26] (03Abandoned) 10Giuseppe Lavagetto: Add records for puppetdb-$site.w.o [dns] - 10https://gerrit.wikimedia.org/r/311416 (owner: 10Giuseppe Lavagetto) [15:22:55] addshore: also found the jobrunner runs the curl commands using sh ... [15:26:40] Hi, do you think someone can help me out with my phab ticket? Its a little urgent :( https://phabricator.wikimedia.org/T146036#2648503 [15:26:52] (03CR) 10Ottomata: [C: 031] "Cool, ja I don't like wait_for_10s because it sounds more like a function or action name, rather than a static variable that just represen" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) (owner: 10Elukey) [15:26:59] addshore: yes [15:27:06] AaronSchulz: great! :) [15:27:20] addshore: e.g. or external store servers have dbnames like enwiki, commonswiki,..., each with blobs tables [15:27:26] *our [15:28:46] AaronSchulz: I cant remember the cluster / dbname for the shared extension dbs, but I would access them using that call basically? passing in the shared cluster and extension db dbname? [15:29:05] yeah [15:29:10] for wmf? we use extension1 [15:29:24] thats the one! :) great! [15:29:48] (03PS1) 10Giuseppe Lavagetto: Switch codfw and ulsfo to puppetmaster2001/puppetdb [dns] - 10https://gerrit.wikimedia.org/r/311435 [15:30:26] 06Operations, 10MediaWiki-JobRunner: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2648543 (10hashar) [15:30:36] addshore: I fixed an OOM bug in https://gerrit.wikimedia.org/r/#/c/259660/7/src/RedisJobService.php [15:30:40] * AaronSchulz keeps looking [15:31:03] ahh, nevermind [15:31:06] yeh, I just spotted your edit! good catch (its been a long while since I have looked at that) [15:31:08] it already checked :) [15:31:23] still good for sanity though [15:31:23] josephine: if no one responds, try pinging the person on "ops clinic duty" in channel topic (today it's apergos) [15:31:45] in theory it's someone else as of monday but they might be in sf or so [15:33:14] PROBLEM - Varnishkafka log producer on cp1050 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:33:17] (03PS1) 10Giuseppe Lavagetto: puppetmaster: use puppetdb everywhere, configure accordingly [puppet] - 10https://gerrit.wikimedia.org/r/311436 [15:33:37] hm you guys do maintain these addresses, we basically have nothing to do with them. let me see if I can figure anything out from the error messages [15:33:51] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-br-fr: Rebuild for Jessie [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/311418 (owner: 10KartikMistry) [15:36:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, just blocking this for a day to write an report handler that will also get the servermon DB updated, since it will no longer be upda" [puppet] - 10https://gerrit.wikimedia.org/r/311436 (owner: 10Giuseppe Lavagetto) [15:36:50] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-nno-nob: Rebuild for Jessie [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/311422 (owner: 10KartikMistry) [15:36:58] (03PS1) 10RobH: robh back from vacation [puppet] - 10https://gerrit.wikimedia.org/r/311437 [15:37:51] MatmaRex: Ok, thank you! [15:38:01] 06Operations, 10Ops-Access-Requests: access request for debt on stat1003, stat1002, and fluorine - https://phabricator.wikimedia.org/T145914#2648575 (10debt) Hi @ArielGlenn - I think I've got everything signed and done. Please let me know if I missed anything! [15:38:05] <_joe_> akosiaris: do you think we should block on that? [15:38:12] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.439 second response time [15:38:13] RECOVERY - Varnishkafka log producer on cp1050 is OK: PROCS OK: 1 process with command name varnishkafka [15:38:35] _joe_: yes, couple of hours up to a day, not more [15:38:36] apergos: Do you know who it might be? [15:39:07] josephine: who might be... what? sorry [15:39:29] apergos: Oh sorry, I just saw your other message about looking into the error messages [15:39:37] mw1246-mw1248 have been reimaged, just silenced them [15:39:38] sorry [15:39:53] josephine: yes, I am looking at it but this is basically a black box for ops, OIT handles it all [15:40:02] I was going to see if I could get any hints from the errors [15:40:04] (03CR) 10RobH: [C: 032] robh back from vacation [puppet] - 10https://gerrit.wikimedia.org/r/311437 (owner: 10RobH) [15:40:28] (03PS2) 10RobH: robh back from vacation [puppet] - 10https://gerrit.wikimedia.org/r/311437 [15:40:34] josephine: https://phabricator.wikimedia.org/T145800 [15:40:38] have a look at that [15:40:41] <_joe_> akosiaris: ok cool [15:40:45] I am betting it is that [15:40:56] byron: ^ [15:43:06] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2648610 (10akosiaris) [15:43:08] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2648611 (10akosiaris) [15:43:19] akosiaris: good catch [15:44:41] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2648404 (10akosiaris) I am blocking this on T146036. The account vfowler@wikimedia.org does not exist indeed on the OIT LDAP mirrors in production, which explains the behavior on described on this task an... [15:47:29] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648639 (10BBlack) I've done some analysis on our 1/1000 sampled data on oxygen yesterday, so the next couple of posts go into the methodology, data, and results... [15:47:36] 06Operations, 10Analytics: Remove cronspam from stat1002 to root@ - https://phabricator.wikimedia.org/T145606#2635815 (10Milimetric) p:05Normal>03Low [15:48:21] (03PS1) 10Faidon Liambotis: mirrors: fix autoindex for tails [puppet] - 10https://gerrit.wikimedia.org/r/311444 [15:48:46] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: fix autoindex for tails [puppet] - 10https://gerrit.wikimedia.org/r/311444 (owner: 10Faidon Liambotis) [15:50:02] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-br-fr_0.5.0~r61325-1+wmf1 [15:50:02] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-nno-nob_1.1.0~r66076-1+wmf1 [15:50:04] T107306: Package apertium (and dependencies) for Jessie - https://phabricator.wikimedia.org/T107306 [15:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:33] AaronSchulz: any clue how the daemon is made to log to /var/log/mediawiki/jobrunner.log ? I cant find anything :( [15:52:34] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:52:42] RECOVERY - salt-minion processes on mw1247 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:54:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is quite fine, comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308679 (https://phabricator.wikimedia.org/T144588) (owner: 10KartikMistry) [15:57:16] hashar: the upstart config [15:57:28] 06Operations, 10MediaWiki-JobRunner: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2648749 (10hashar) The jobrunner services writes to stdout/stderr. Apparently with systemd that is caught and send to syslog, then we have some rsyslog configuration from... [15:57:46] AaronSchulz: it is no more inupstart but systemd. it catch stdout/stderr and send them to syslog :/ [15:58:26] AaronSchulz: we then have a rule in rsyslog to catch the lines and write them to /var/log/mediawiki/ details at https://phabricator.wikimedia.org/T146040#2648749 [15:58:55] AaronSchulz: long story short, I havent deployed the three patches you merged in :] I am now in a meeting [15:59:03] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [15:59:11] AaronSchulz: thanks [15:59:41] i'm backporting a few things to our currently deployed wikibase, just in case new version of core is deployed but not new wikibase [16:00:19] (we do intend to deploy new code, though, this week) [16:05:39] (03CR) 10Lydia Pintscher: [C: 04-1] "As I said in the ticket I think we should start with only a few units for one dimension and see how that goes. This could be length for ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (owner: 10Smalyshev) [16:06:50] RECOVERY - mediawiki-installation DSH group on mw1248 is OK: OK [16:07:19] RECOVERY - mediawiki-installation DSH group on mw1247 is OK: OK [16:07:52] RECOVERY - mediawiki-installation DSH group on mw1246 is OK: OK [16:08:33] (03PS1) 10Andrew Bogott: Puppet Horizon Panel: Deploy on Labs [puppet] - 10https://gerrit.wikimedia.org/r/311449 (https://phabricator.wikimedia.org/T91990) [16:15:45] (03CR) 10KartikMistry: WIP: Apertium: Update new packages for Jessie migration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308679 (https://phabricator.wikimedia.org/T144588) (owner: 10KartikMistry) [16:15:55] (03PS2) 10KartikMistry: Apertium: Update new packages for Jessie migration [puppet] - 10https://gerrit.wikimedia.org/r/308679 (https://phabricator.wikimedia.org/T144588) [16:19:21] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:23:21] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:28] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2648943 (10Cmjohnson) The status is we are waiting on @fgiunchedi to give users time to migrate anything from their home dir. [16:24:59] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2648944 (10Cmjohnson) HP Tech came today and replaced the backplane. [16:36:03] 06Operations, 07Puppet, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2648990 (10Halfak) [16:42:53] 06Operations, 10Mail: mx1001/2001 - Exim SMTP - Certificate expires Sep 22 2016 - https://phabricator.wikimedia.org/T144568#2649010 (10RobH) Order approval for these is on the #procurement S4 sub-task. Implementation details will be tracked here. [16:43:07] (03CR) 10Lydia Pintscher: [C: 031] "Seems ok now based on discussion in ticket." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [16:44:40] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:47:58] (03CR) 10Andrew Bogott: [C: 032] Puppet Horizon Panel: Deploy on Labs [puppet] - 10https://gerrit.wikimedia.org/r/311449 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [16:48:31] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:54:41] (03CR) 10Paladox: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [16:54:53] (03CR) 10jenkins-bot: [V: 04-1] De-deploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [16:57:14] (03PS1) 10Alexandros Kosiaris: puppetmaster: Ping before sending requests to backend [puppet] - 10https://gerrit.wikimedia.org/r/311457 [16:57:51] (03PS3) 10Jforrester: De-deploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160919T1700). [17:04:47] (03CR) 10Urbanecm: "@Lydia Okay. Is it correct to have this deployed ASAP? I mean in the nearest EU/Morning SWAT window. If so, notice me and I'll schedule it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [17:04:54] (03PS2) 10Urbanecm: Change $wgArticleCountMethod in Wikidata from default ('link') to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) [17:09:32] (03PS1) 10Andrew Bogott: Puppet Panel: check for empty hiera before we start enumerating keys. [puppet] - 10https://gerrit.wikimedia.org/r/311461 [17:11:30] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: check for empty hiera before we start enumerating keys. [puppet] - 10https://gerrit.wikimedia.org/r/311461 (owner: 10Andrew Bogott) [17:13:14] (03CR) 10Dzahn: [C: 04-1] "for some reason it's always CRIT for other non-temperature reasons, like power supply or case instrusion (when testing on lead), even when" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:14:30] (03CR) 10Dzahn: "root@lead:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor -T temperature" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:15:35] (03CR) 10Dzahn: "-x is to exclude certain sensors, but already tried, gotta try more" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:17:30] (03CR) 10Dzahn: "-vvv to show sensor IDs. example "2" is "Intrusion". but with -x 2 we still see System Board Intrusion = Critical" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [17:20:46] (03CR) 10Smalyshev: "Daniel, see https://phabricator.wikimedia.org/T117032 I've described differences with GNU units on conversion factors there. Mostly for ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (owner: 10Smalyshev) [17:21:29] (03PS1) 10Jcrespo: mariadb: Add additional node (db1061) for mysqld_safe testing [puppet] - 10https://gerrit.wikimedia.org/r/311465 (https://phabricator.wikimedia.org/T145378) [17:22:40] is it just me or gerrit has lower latency? [17:23:12] (03PS2) 10Smalyshev: Add config for units on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) [17:24:04] (03CR) 10Jcrespo: [C: 032] mariadb: Add additional node (db1061) for mysqld_safe testing [puppet] - 10https://gerrit.wikimedia.org/r/311465 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [17:32:33] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2649299 (10matmarex) [17:35:38] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2649315 (10bbogaert) Hi @MoritzMuehlenhoff, I added the yubikey attribute for two-factor -vpn. The following attribute has been added to core.schema: attributetype ( 1.0 NAME 'yubikey' DESC 'Serial number... [17:36:08] !log Reset wikitech/horizon 2fa for Greg per request [17:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:21] twentyafterfour: are you doing phab maint? it just ate my task [17:36:41] yuvipanda: no [17:36:46] ate your task? [17:37:02] I hit 'save' and then I got a 500 [17:37:11] equest from 50.1.84.14 via cp4002 cp4002, Varnish XID 301678640 [17:37:12] Error: 503, Backend fetch failed at Mon, 19 Sep 2016 17:35:40 GMT [17:37:14] try refreshing [17:37:27] if you hit back you'll lose it but it should resubmit? [17:37:29] ah, uneaten. nice [17:37:46] must have been a fluke, I don't see any problems on iridium [17:38:13] 06Operations, 06Labs, 10PAWS, 06Research-and-Data, 10hardware-requests: Purchase new labsdbs for PAWS / Quarry - https://phabricator.wikimedia.org/T146061#2649334 (10chasemp) p:05Triage>03Normal [17:46:45] 06Operations, 06Labs, 06Research-and-Data, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10yuvipanda) [17:46:59] 06Operations, 06Labs, 06Research-and-Data, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649428 (10yuvipanda) [17:47:03] 06Operations, 06Labs, 10PAWS, 06Research-and-Data, 10hardware-requests: Purchase new labsdbs for PAWS / Quarry - https://phabricator.wikimedia.org/T146061#2649430 (10yuvipanda) [17:52:37] (03PS1) 10Dzahn: admin: create shell account for Sam Walton [puppet] - 10https://gerrit.wikimedia.org/r/311473 (https://phabricator.wikimedia.org/T145788) [17:52:46] (03CR) 10Alex Monk: Remove mediawiki02 from deployment prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [17:54:50] (03PS9) 10Yuvipanda: labs: Add a standalone puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 [17:55:02] 06Operations, 06Labs, 06Research-and-Data, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649465 (10chasemp) p:05Triage>03Normal [17:55:26] (03PS2) 10Dzahn: admin: create shell account for Sam Walton [puppet] - 10https://gerrit.wikimedia.org/r/311473 (https://phabricator.wikimedia.org/T145788) [17:55:47] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2649474 (10BBlack) Diving a bit deeper on interpreting the above into useful planning: first, again, a few salient higher-level points: * **Bin4 (64M-1G) Issues... [18:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160919T1800). [18:00:05] Urbanecm: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [18:00:13] Present [18:00:38] I can SWAT today [18:00:50] No problem with it :) [18:01:06] thcipriani: I'm in a bit of a hurry, can you please swat mine first? [18:01:10] if at all possible [18:01:36] mafk: sure, np [18:01:42] ty [18:02:06] (03PS2) 10Thcipriani: Enable ShortURL on Wikimedia Bangladesh chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311383 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [18:02:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311383 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [18:02:44] (03Merged) 10jenkins-bot: Enable ShortURL on Wikimedia Bangladesh chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311383 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [18:03:17] * mafk enables x-wikimedia-debug [18:03:20] mafk: live on mw1099, check please [18:04:39] thcipriani: MediaWiki internal error. [18:04:41] Exception caught inside exception handler. [18:04:42] Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [18:04:52] not sure why I got this [18:04:57] on mw1099 [18:04:58] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2649521 (10Aklapper) Clarification: If anyone wants to discuss the use of .wiki TLDs, please use T88873 - Thanks. (I merely closed T145... [18:05:19] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 11 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sda3],Exec[xfs_label-/dev/sdb3],Exec[parted-/dev/sdn] [18:06:01] mafk: update.php wasn't executed. [18:06:22] Urbanecm: and maybe populateShortURLtable either [18:06:30] The extension requires shorturls tables which doesn't exist so it can't run... [18:06:37] 'cause I don't see the shorturl link in the sidebar [18:06:53] yep, the docs say "optional", heh [18:07:03] thcipriani: can't check without the tables it seems [18:07:06] The table doesn't exist so how can I populate it? [18:07:19] (not only me, everybody of course) [18:07:20] Urbanecm: a script is used for that [18:07:21] mafk: yup. I can create. [18:07:30] on terbium? [18:07:33] IDK [18:07:36] too many servers [18:07:44] thcipriani: I added a patch, not sure why jouncebot didn't get it [18:07:59] A database query error has occurred. This may indicate a bug in the software. [18:08:01] Function: ShortUrlUtils::decodeURL [18:08:01] Amir1: yeah, I saw it. [18:08:02] Error: 1146 Table 'bdwikimedia.shorturls' doesn't exist (10.64.16.191) [18:08:29] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649537 (10ggellerman) [18:11:05] mafk: table created [18:11:12] I'm checking [18:12:19] mafk: do I need to run populateShortUrlTable? [18:12:40] thcipriani: I'd say yes because I still don't see the links in the sidebar [18:12:56] however special:shorturl does not throw exceptions anymore, nor wiki pages [18:13:39] mafk: done. [18:14:01] !log ran on terbium: mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=bdwikimedia [18:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:41] https://bd.wikimedia.org/s/9z --> 404 [18:14:44] :S [18:15:16] rewrite rules maybe? [18:15:20] oh well, maybe because it's 1099? [18:15:38] last time I did shorturl there was no mw1099 :) [18:17:25] hrm, the 404 has the server: mw1099 header...so I'm not clear if that's it. [18:17:26] thcipriani: I'd say to get this live in production and test [18:17:38] oh uhm [18:18:45] ahd it's giving strange shortulrs, usualy they have more numbers and letters [18:18:52] and* [18:19:49] thcipriani: not sure what to say [18:20:30] mafk: I'm going to roll-back the commit for now. I will comment on the task that I've created tables and ran maintenance but we couldn't verify working. [18:20:40] acceptable? [18:20:42] thcipriani: I agree it's better [18:20:50] mafk: kk, sorry :(( [18:21:00] thcipriani: not your fault [18:21:07] something strange is happening [18:22:04] mafk: kk, just reverted on mw1099, want to make sure everything looks normal there? [18:22:34] Maybe roll it to production and be able to revert fastly if needed, maybe it's only because mw1099... [18:23:01] thcipriani: shorturl link still on sidebar [18:23:24] maybe the shorturl table can be nuked? [18:25:39] mafk: done [18:25:51] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2649707 (10greg) This is really a follow-up item from a wikimedia incident. [18:26:11] /away [18:26:17] no, actually back :p [18:26:20] LOL [18:26:25] xD [18:26:40] thcipriani: still seing the short url link on sidebar at mw1099 [18:27:19] thcipriani: gone now [18:27:29] ctrl+shift+r solved it [18:27:34] mafk: heh, that's good. [18:27:48] thcipriani: shall I create a revert patch? [18:28:07] mafk: nah, I just need to push this up to gerrit and it'll be done [18:28:18] okay [18:28:38] anything else I'm required for? [18:28:38] thcipriani: Could my patch be processed as the last one? [18:29:00] (03PS1) 10Dzahn: admin: add samwalton9 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/311480 (https://phabricator.wikimedia.org/T145788) [18:29:10] (03PS1) 10Thcipriani: Revert "Enable ShortURL on Wikimedia Bangladesh chapter wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311481 [18:29:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311481 (owner: 10Thcipriani) [18:29:53] Well, thanks all and sorry for the fuss. This errors were totally unexpected :( [18:29:56] (03Merged) 10jenkins-bot: Revert "Enable ShortURL on Wikimedia Bangladesh chapter wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311481 (owner: 10Thcipriani) [18:30:01] Urbanecm: you mean you want your patch to be last? [18:30:10] (03PS2) 10Dzahn: admin: add samwalton9 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/311480 (https://phabricator.wikimedia.org/T145788) [18:30:21] /away [18:32:15] mode -t plz [18:32:52] !log emergency/unscheduled restart of mariadb @ labsdb1003 - close to OOM, unusable [18:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:33] (03PS2) 10Thcipriani: ORES default threshold to high for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311229 (https://phabricator.wikimedia.org/T144784) (owner: 10Ladsgroup) [18:33:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311229 (https://phabricator.wikimedia.org/T144784) (owner: 10Ladsgroup) [18:34:22] (03Merged) 10jenkins-bot: ORES default threshold to high for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311229 (https://phabricator.wikimedia.org/T144784) (owner: 10Ladsgroup) [18:35:03] Amir1: your patch is live on mw1099, check please [18:35:30] thanks [18:35:33] I check it now [18:36:54] I'm back... [18:37:34] (03CR) 10Lydia Pintscher: "> Lydia, for me it looks a bit strange to have conversions work for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [18:38:20] Urbanecm: kk, haven't merged yours yet. Still checking previous patch. [18:38:27] (03CR) 10Lydia Pintscher: "> @Lydia Okay. Is it correct to have this deployed ASAP? I mean in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [18:38:32] thcipriani: Okay. [18:38:40] BTW what kk means? [18:38:57] (03CR) 10Urbanecm: "Okay, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [18:39:03] 06Operations, 10Ops-Access-Requests: access request for debt on stat1003, stat1002, and fluorine - https://phabricator.wikimedia.org/T145914#2644948 (10Dzahn) Hi @debt Looking at your SSH key, i assume the "/" at the beginning of the lines are not part of the actual key but got in there via copy/paste from an... [18:39:34] Urbanecm: heh, "ok ok", I guess :) Someone mentioned that me saying, "ack" for "acknowledge" make them thing something bad was happening. [18:39:42] *made them think [18:40:15] I will make the SWAT-time adjustment to just: "ok" :) [18:41:15] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2649804 (10Gilles) @ori suggested looking into tmpreaper, which might be something useful in general to avoid ancient tmp files laying around,... [18:41:18] thcipriani: Okay. I guessed it means something like "Message was accepted" or ok but I didn't know from what it comes. Thanks for your explanation. You can use everything you want. [18:41:28] :D [18:41:52] thcipriani: It works as expected [18:42:06] Amir1: ok, going live everywhere. [18:42:25] heh [18:42:37] mutante: ^ [18:42:55] I guess +t got added to the mlock of this channel...a real op is going to have to change it. [18:43:02] (03PS1) 10Dzahn: admin: create shell account for Deborah Tankersley [puppet] - 10https://gerrit.wikimedia.org/r/311482 (https://phabricator.wikimedia.org/T145914) [18:43:38] grrrit-wm: you missed one [18:43:49] legoktm: oh, thanks! [18:43:56] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:311229|ORES default threshold to high for wikidatawiki (T144784)]] (duration: 00m 47s) [18:43:58] T144784: Change default threshold for Wikidata to high - https://phabricator.wikimedia.org/T144784 [18:43:58] ^ Amir1 live everywhere [18:43:59] mutante which one? [18:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:29] paladox: i take it back, it was lag on my part [18:44:34] got a bunch of updates at once [18:44:35] LOL, ok [18:44:40] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [18:44:47] ^that's me [18:44:54] (03PS2) 10Thcipriani: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311417 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [18:45:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311417 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [18:45:25] thcipriani: Tested in live too, works fine [18:45:28] thanks [18:45:31] (03Merged) 10jenkins-bot: Throttle for RCL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311417 (https://phabricator.wikimedia.org/T145838) (owner: 10Urbanecm) [18:45:37] Amir1: thank you for checking! Glad it's working :) [18:45:50] :) [18:46:19] Urbanecm: change is live on mw1099 if there's anything you'd like to check there [18:46:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2649848 (10Dzahn) [18:46:39] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:47:01] thcipriani: I don't know what should I test with throttling rules. Please deploy it everywhere. [18:47:10] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [18:47:27] Urbanecm: no explosions, I suppose. Deploying everywhere. [18:47:50] (03PS3) 10Smalyshev: Add config for units on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) [18:48:09] thcipriani: WMF's servers are far from me :D . So I can't see any explosions there :D [18:48:34] (03CR) 10Smalyshev: "OK, here's config only for units of length" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [18:48:42] But thanks a lot for your deployment thcipriani ! [18:48:47] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2649866 (10debt) Hi @Dzahn - I just checked the SSH key in my file and it does include many "/" and the double "//" on the la... [18:48:59] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4337403 keys - replication_delay is 0 [18:49:18] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:311417|Throttle for RCL (T145838)]] (duration: 00m 47s) [18:49:19] T145838: IP lift cap for Research Commons Librarian - 19/09/16 and 21/09/16 - https://phabricator.wikimedia.org/T145838 [18:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:27] Urbanecm: you're welcome ^ should be live everywhere now, FYI. Thanks for the patch :) [18:49:39] You're welcome too. [18:50:18] okie doke: SWAT is complete [18:52:32] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2649880 (10Dzahn) Hi @debt, thank you for checking. No, it's fine. Nevermind then, leave it as it is. I'll handle this reques... [18:58:27] A little question, why we won't be deploying non-emergency deploys next week? [18:59:29] It's ops's offsite next week. We won't have anyone with root around to fix things should deploys go catastrophically wrong. [19:00:02] "No non-emergency deploys. Operations offsite all week." ;) [19:01:45] thanks you both. [19:01:58] I misunderstood the notice about operations. [19:04:05] (03PS1) 10Dzahn: admin: create shell account for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311485 (https://phabricator.wikimedia.org/T145387) [19:06:41] I should add "team" to that sentence fragment [19:07:20] (03PS1) 10Andrew Bogott: Pupept panel: Use "{}" to represent an empty hiera config [puppet] - 10https://gerrit.wikimedia.org/r/311486 (https://phabricator.wikimedia.org/T91990) [19:07:33] :) But thanks, now I understand. [19:12:25] (03PS2) 10Andrew Bogott: Puppet panel: Use "{}" to represent an empty hiera config [puppet] - 10https://gerrit.wikimedia.org/r/311486 (https://phabricator.wikimedia.org/T91990) [19:12:27] (03PS1) 10Andrew Bogott: Puppet Panel: Don't choke on a (user-provided) "" hiera string. [puppet] - 10https://gerrit.wikimedia.org/r/311488 (https://phabricator.wikimedia.org/T91990) [19:16:19] (03CR) 10Andrew Bogott: [C: 032] Puppet panel: Use "{}" to represent an empty hiera config [puppet] - 10https://gerrit.wikimedia.org/r/311486 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [19:17:04] (03PS2) 10Andrew Bogott: Puppet Panel: Don't choke on a (user-provided) "" hiera string. [puppet] - 10https://gerrit.wikimedia.org/r/311488 (https://phabricator.wikimedia.org/T91990) [19:18:39] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Don't choke on a (user-provided) "" hiera string. [puppet] - 10https://gerrit.wikimedia.org/r/311488 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [19:25:11] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649945 (10chasemp) hi @ggellerman thanks! I believe this has been specially budgeted for in Q2 of 2016 and should work within that budg... [19:39:52] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2649992 (10chasemp) [19:43:00] for the people that do deployments, is #wikimedia-log-errors enough to do a heads up about possible ongoing issues? [19:43:07] (03PS10) 10Yuvipanda: labs: Add a standalone puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 (https://phabricator.wikimedia.org/T120159) [19:45:26] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2650020 (10thcipriani) >>! In T144578#2639586, @hashar wrote: > @mmodell @thcipriani @demon @dduvall can you check mira02 on beta is all fine ?... [19:45:56] (03PS11) 10Yuvipanda: labs: Add a standalone puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 (https://phabricator.wikimedia.org/T120159) [19:46:02] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Add a standalone puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/311163 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [19:48:31] (03CR) 10Rush: "I don't have any issue w/ the idea of the patch, I'm not sure how helpful it will be though. I also haven't reviewed this in the context " [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/309406 (https://phabricator.wikimedia.org/T143943) (owner: 10Hashar) [19:50:30] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160919T2000). [20:00:25] no parsoid deploy today [20:08:10] !log restbase deploy 4829630f staging [20:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:38] i don't believe we have a mobileapps deployment planned /cc bearND [20:10:46] correct. We'll probably do one on Wednesday [20:13:36] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2641471 (10Dzahn) @bbogaert reports he has deleted "yubikey" on their side. i went to dubnium to check logs, i still see the: Sep 19 19:12:49 dubnium slapd[24513]: syncrepl_message_to_entry: rid=001 mods check (yub... [20:14:42] !log reboot labstore1004 [20:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:40] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:45] (03PS1) 10Andrew Bogott: labspuppetbackend: fix a bug with detecting project-wide prefixes [puppet] - 10https://gerrit.wikimedia.org/r/311500 [20:19:03] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2650106 (10Dzahn) It still failed at 20:12 Sep 19 20:12:50 dubnium slapd[24513]: syncrepl_message_to_entry: rid=001 mods check (YUBIKEY: attribute type undefined) but notice this difference: Sep 19 14:12:47 dubni... [20:21:23] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: fix a bug with detecting project-wide prefixes [puppet] - 10https://gerrit.wikimedia.org/r/311500 (owner: 10Andrew Bogott) [20:23:00] anomie: https://gerrit.wikimedia.org/r/#/c/311492/ [20:40:01] !log Removed today's Wikidata json dumps: All shards succeeded, but final dump composition apparently failed. [20:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:08] aude: apergos: FYI ^ [20:49:46] 06Operations, 06Editing-Department, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2650309 (10hashar) [20:50:30] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481685 (10hashar) Account creation got broken entirely for 18 hours last week despite metrics being available. I have... [20:54:33] (03PS1) 10Hoo man: More error logging for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 [21:00:05] dapatrick and bawolff: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160919T2100). Please do the needful. [21:01:09] ack [21:10:30] (03PS1) 10Andrew Bogott: Puppet Panel: Use the right template for the remove-role dialog [puppet] - 10https://gerrit.wikimedia.org/r/311590 [21:13:30] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Use the right template for the remove-role dialog [puppet] - 10https://gerrit.wikimedia.org/r/311590 (owner: 10Andrew Bogott) [21:13:53] (03PS2) 10Hoo man: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 [21:15:37] (03PS3) 10Hoo man: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 [21:17:06] (03CR) 10Hoo man: "Once this has been merged, I'll start another dump run." [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man) [21:18:30] (03PS1) 10Ppchelko: RESTBase: Specify the topic for transclusions. [puppet] - 10https://gerrit.wikimedia.org/r/311594 (https://phabricator.wikimedia.org/T145804) [21:28:42] OK, going to try getting train a bit closer to back on track since it looks like blockers have been resolved. [21:29:47] grrrit-wm: :(( [21:30:01] group0 to wmf.19 https://gerrit.wikimedia.org/r/#/c/311599/ [21:30:01] (03Merged) 10jenkins-bot: group0 wikis to 1.28.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311599 (owner: 10Thcipriani) [21:30:37] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.28.0-wmf.19 [21:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:31:18] (03PS1) 10BBlack: upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) [21:32:41] account creation still seems to work for group0 wikis, so that's good. [21:33:23] (03CR) 10jenkins-bot: [V: 04-1] upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [21:35:07] (03PS2) 10BBlack: upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) [21:35:42] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:36:54] (03CR) 10jenkins-bot: [V: 04-1] upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [21:38:24] (03PS3) 10BBlack: upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) [21:40:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [21:41:53] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Revert group0 wikis to 1.28.0-wmf.19 [21:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:36] different set of errors on wmf.19 than were reported during the deployments last week. Will file blockers. [21:45:24] thcipriani: ugh, thanks :/ [21:47:47] (03PS4) 10BBlack: upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) [21:49:03] (03PS1) 10Volans: Reimage: improve output in case of errors [puppet] - 10https://gerrit.wikimedia.org/r/311605 (https://phabricator.wikimedia.org/T143536) [21:49:19] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2650574 (10bbogaert) Update: The sync was getting stuck on person's whom had an ldap attribute. I added the attribute back, deleted the attribute from the people that had it, and then removed the attribute from the s... [21:49:41] wth: There are lots of machines that don't have cdb files for wmf.19. I wonder if that's the root cause of all the error blow-up? [21:50:12] ^ greg-g I'm going to try a full scap that just moves testwiki to rebuild cdbs for wmf.19 [21:50:20] ...if you're fine with that. [21:50:38] :) yeah [21:50:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:50:51] could be, that top one being in Language.php [21:51:26] yeah, the second one explicitly saying: no localization for english [21:52:18] (03CR) 10Volans: [C: 032] Reimage: improve output in case of errors [puppet] - 10https://gerrit.wikimedia.org/r/311605 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [21:52:20] greg-g: thcipriani FYI, I opened https://phabricator.wikimedia.org/T146094 i am not sure if it is an issue, but might be related to wmf18 [21:53:28] (03PS1) 10Thcipriani: Revert "group0 wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311606 [21:53:44] (03CR) 10Thcipriani: [C: 032] Revert "group0 wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311606 (owner: 10Thcipriani) [21:54:10] (03PS5) 10BBlack: upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) [21:54:12] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311606 (owner: 10Thcipriani) [21:54:19] 06Operations, 10Wikimedia-Mailing-lists: Remove luis@lu.is from travel@ list - https://phabricator.wikimedia.org/T146095#2650583 (10SMalikWMF) [21:54:25] (03PS6) 10Yuvipanda: wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 [21:56:06] !log thcipriani@tin Started scap: testwiki to php-1.28.0-wmf.19 and rebuild l10n cache [21:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:27] matanya: hrm that's odd. [21:56:35] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:56:52] thcipriani: that is what AaronSchulz said [21:56:56] (03PS7) 10Yuvipanda: wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 (https://phabricator.wikimedia.org/T120159) [21:57:19] (03CR) 10Yuvipanda: [C: 032 V: 032] wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:59:21] 06Operations, 10Wikimedia-Mailing-lists: Remove luis@lu.is from travel@ list - https://phabricator.wikimedia.org/T146095#2650583 (10Krenair) travel@lists.wikimedia.org? [22:00:51] !log cp1099: depooling varnish backends for storage size experimentation [22:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:16] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:02:21] (03PS6) 10BBlack: upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) [22:02:30] (03CR) 10BBlack: [C: 032 V: 032] upload storage size class experiment [puppet] - 10https://gerrit.wikimedia.org/r/311600 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [22:05:19] PROBLEM - Restbase root url on cerium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:43] thcipriani: also the latency grew: https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-platform [22:09:40] 95% and 99% of first paint jumped [22:11:03] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1099 is CRITICAL: Connection refused [22:12:28] thcipriani: there was - 19:02 logmsgbot: demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18 [22:12:29] <_joe_> bblack: I guess this is you? ^ [22:12:35] just before that [22:13:37] _joe_: yes, it's depooled for experimentation above in SAL [22:13:51] (03PS1) 10BBlack: upload storage: fix CL comparisons [puppet] - 10https://gerrit.wikimedia.org/r/311611 (https://phabricator.wikimedia.org/T145661) [22:14:34] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2650663 (10ggellerman) @chasemp Hi! This was on the Research & Data workboard. Because it looks like Yuvi is doing the work, we moved i... [22:14:40] (03CR) 10BBlack: [C: 032 V: 032] upload storage: fix CL comparisons [puppet] - 10https://gerrit.wikimedia.org/r/311611 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [22:15:30] _joe_ documenting the puppetmaster self alternative :) https://wikitech.wikimedia.org/wiki/Standalone_puppetmaster [22:15:32] needs more testing [22:15:37] uses the puppetmaster module [22:16:17] <_joe_> yuvipanda: oh I'll take a look [22:16:48] matanya: hrm, yeah looking at mediawiki load over the past 30 days shows a regression starting on the 8th as well https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=7&fullscreen [22:17:18] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [22:17:28] yes thcipriani was just looking at that [22:17:55] (03Abandoned) 10Yuvipanda: puppet: Use puppetmaster hiera variable directly [puppet] - 10https://gerrit.wikimedia.org/r/256625 (owner: 10Yuvipanda) [22:18:05] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2650671 (10Dzahn) [22:18:07] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2650668 (10Dzahn) 05Open>03Resolved a:03Dzahn confirmed. it's syncing now. also mx1001 knows eng-admin@ root@mx1001:~# exim4 -bt eng-admin@wikimedia.org eng-admin@wikimedia.org router = ldap_group, transpor... [22:18:18] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1099 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.002 second response time [22:18:20] funny part, i felt things were a bit slower, but couldn't get any proof, cause only 600 ms [22:18:47] so went to grafana to confirm [22:19:02] (03Abandoned) 10Yuvipanda: labs: Separate nfs_mode from nfs_opts [puppet] - 10https://gerrit.wikimedia.org/r/271920 (https://phabricator.wikimedia.org/T127561) (owner: 10Yuvipanda) [22:19:09] matanya: could you create a load-time regression task? [22:19:34] will probably want a place to discuss what this means for branching/deployments. [22:19:48] thcipriani: in addition to https://phabricator.wikimedia.org/T146094 ? [22:19:48] might add as a blocker for wmf.19 :(( [22:19:49] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:20:03] matanya: yes please [22:20:14] thcipriani: sorry for finding 3 blockers for .19 [22:20:28] lemme find that task [22:20:50] wmf.19 blockers: https://phabricator.wikimedia.org/T143328 [22:20:54] https://phabricator.wikimedia.org/T143328 [22:20:59] heh, yeah, that one. [22:21:03] matanya: don't apologize, it's what we do :) [22:21:33] yes, linking to that [22:21:43] indeed. I am going to let this scap finish to rebuild l10n for wmf.19, then sync wikiversions to get testwiki back on wmf.18. [22:22:23] :/ [22:23:16] greg-g: does that not sound like the right plan? [22:24:05] oh, don't take my ":/" as second guessing, just a general "ugh" emoticon [22:24:14] (03PS1) 10Yuvipanda: labspuppetbackend: Set charset explicitly [puppet] - 10https://gerrit.wikimedia.org/r/311614 [22:24:18] maybe :| [22:24:33] heh, yes, agreed. It is an emotional :\ [22:24:54] greg-g: OK, what was today's rollback for? [22:25:00] (03PS1) 10BBlack: upload storage: no restart cron w/ experiment [puppet] - 10https://gerrit.wikimedia.org/r/311615 (https://phabricator.wikimedia.org/T145661) [22:25:49] AaronSchulz: I rolled back because lots of boxes were missing l10n for wmf.19 [22:26:06] I'm not clear why or how those cdb files were missing. [22:26:10] oh [22:26:28] I'm running a scap now for testwiki to rebuild l10n [22:26:41] (03CR) 10BBlack: [C: 032] upload storage: no restart cron w/ experiment [puppet] - 10https://gerrit.wikimedia.org/r/311615 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [22:26:59] but it now it seems like there are some performance regressions first-paint and load time. [22:27:55] so I'm going to leave wmf.18 in place until there is some investigation of those issues. [22:29:19] heh, I only red the part until the first . to begin with and was surprised [22:31:01] !log cache_upload: pooling cp1099 (storage experiment - T145661) [22:31:02] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [22:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:39] (03CR) 10Andrew Bogott: [C: 031] labspuppetbackend: Set charset explicitly [puppet] - 10https://gerrit.wikimedia.org/r/311614 (owner: 10Yuvipanda) [22:33:00] (03PS2) 10Yuvipanda: labspuppetbackend: Set charset explicitly [puppet] - 10https://gerrit.wikimedia.org/r/311614 [22:33:05] (03CR) 10Yuvipanda: [C: 032 V: 032] labspuppetbackend: Set charset explicitly [puppet] - 10https://gerrit.wikimedia.org/r/311614 (owner: 10Yuvipanda) [22:34:03] (03PS1) 10Andrew Bogott: Puppet Panel: Remove the ?-in-circle [puppet] - 10https://gerrit.wikimedia.org/r/311618 [22:35:19] (03CR) 10Alex Monk: [C: 04-1] Puppet Panel: Remove the ?-in-circle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311618 (owner: 10Andrew Bogott) [22:36:21] (03CR) 10jenkins-bot: [V: 04-1] Puppet Panel: Remove the ?-in-circle [puppet] - 10https://gerrit.wikimedia.org/r/311618 (owner: 10Andrew Bogott) [22:36:37] 06Operations, 06Editing-Department, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2650309 (10Tgr) We might want separate api and non-api metrics since they have differ... [22:36:54] (03PS2) 10Andrew Bogott: Puppet Panel: Remove the ?-in-circle [puppet] - 10https://gerrit.wikimedia.org/r/311618 [22:38:35] (03PS1) 10BBlack: upload frontend: limit at 256KB [puppet] - 10https://gerrit.wikimedia.org/r/311619 [22:39:43] (03CR) 10BBlack: [C: 032] upload frontend: limit at 256KB [puppet] - 10https://gerrit.wikimedia.org/r/311619 (owner: 10BBlack) [22:40:24] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2650747 (10greg) [22:42:10] greg-g: i'd add RTL breakage to this ^ [22:43:55] 06Operations, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2650768 (10greg) [22:46:49] (03PS3) 10Andrew Bogott: Puppet Panel: Remove the ?-in-circle [puppet] - 10https://gerrit.wikimedia.org/r/311618 [22:48:07] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2650788 (10DarTar) @chasemp @ggellerman yes, this is part of dedicated capex budget for FY16-17. If there are separate tickets where app... [22:48:09] !log thcipriani@tin Finished scap: testwiki to php-1.28.0-wmf.19 and rebuild l10n cache (duration: 52m 03s) [22:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:49:50] (03CR) 10BBlack: [C: 04-1] "Should depool the nginx service, too" [puppet] - 10https://gerrit.wikimedia.org/r/311387 (owner: 10Ema) [22:50:40] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Remove the ?-in-circle [puppet] - 10https://gerrit.wikimedia.org/r/311618 (owner: 10Andrew Bogott) [22:53:18] !log Deployed patch for T144573 to wmf18 and wmf19 [22:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:57:09] PROBLEM - Varnishkafka log producer on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [22:58:12] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Restore testwiki to 1.28.0-wmf.18 [22:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:59:52] RECOVERY - Varnishkafka log producer on cp1048 is OK: PROCS OK: 1 process with command name varnishkafka [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160919T2300). [23:06:12] (03PS1) 10BBlack: varnish-backend-restart: increase sleep times [puppet] - 10https://gerrit.wikimedia.org/r/311623 [23:06:29] (03CR) 10BBlack: [C: 032 V: 032] varnish-backend-restart: increase sleep times [puppet] - 10https://gerrit.wikimedia.org/r/311623 (owner: 10BBlack) [23:07:59] 06Operations, 10Mail: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2650843 (10bbogaert) [23:22:51] !log restart restbase in staging [23:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:26] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [23:27:26] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15293 bytes in 0.016 second response time [23:32:50] PROBLEM - Varnishkafka log producer on cp3047 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [23:45:15] RECOVERY - Varnishkafka log producer on cp3047 is OK: PROCS OK: 1 process with command name varnishkafka [23:46:20] 06Operations, 10Wikimedia-Mailing-lists: Remove luis@lu.is from travel@ list - https://phabricator.wikimedia.org/T146095#2650583 (10Dzahn) I removed Luis. Successfully Removed: luis@lu.is For the future please note that travel@lists is run by Ellie and Doreen , not Operations Travel list run by eyoun... [23:46:39] 06Operations, 10Wikimedia-Mailing-lists: Remove luis@lu.is from travel@ list - https://phabricator.wikimedia.org/T146095#2651008 (10Dzahn) 05Open>03Resolved a:03Dzahn [23:57:13] (03PS1) 10BBlack: add VCL variable "uptime" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/311632 [23:57:46] (03CR) 10BBlack: "Completely untested, even for successful compilation!" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/311632 (owner: 10BBlack)