[00:39:06] Krenair: Happening again... [00:41:46] https://www.irccloud.com/pastebin/AUpoVv6H/ [02:01:38] (03PS1) 10Yuvipanda: Figure out proper webservice type before emitting warnings [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/288875 (https://phabricator.wikimedia.org/T135349) [02:05:38] 06Operations, 10MediaWiki-Vagrant, 10Traffic: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#2296183 (10Mattflaschen) This is about MediaWiki-Vagrant. [02:22:09] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.1) (duration: 09m 44s) [02:30:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 16 02:30:49 UTC 2016 (duration 8m 40s) [03:21:53] so morebots died more than a day ago [04:39:06] 06Operations: Morebots be dead - https://phabricator.wikimedia.org/T135355#2296243 (10Peachey88) [05:17:09] (03Abandoned) 10Ladsgroup: wikilabels: enable CORS [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [06:16:43] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:25:27] (03CR) 10Giuseppe Lavagetto: [C: 031] "We should also add a --xml option then, and add the ability to run an xsl template (via --xsl) on the output for greater flexibility!!" [software/conftool] - 10https://gerrit.wikimedia.org/r/288632 (owner: 10Mobrovac) [06:25:34] <_joe_> mobrovac: ^^ [06:26:04] <_joe_> :P [06:26:55] ah _joe_, xml is so b2b and java-esque [06:28:00] <_joe_> mobrovac: I added xsl to the mix to raise the horror [06:28:12] haha [06:28:19] <_joe_> and I must correct you: xml is not truly java-esque unless you have at least 4 nested namespaces [06:28:35] indeed [06:28:43] <_joe_> atm I am working on making conftool use a json-defined schema btw [06:28:48] i often equate b2b2 and java [06:28:56] s/b2b2/b2b/ [06:29:11] <_joe_> java is enterprise [06:30:11] to ease the pain of writing schemas, you should definitively allow them to written in yaml [06:30:52] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: puppet fail [06:31:43] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:02] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:32] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 3 failures [06:35:05] (03PS3) 10Mobrovac: Enable MathML rendering by default on test, wikidata and dewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286180 (https://phabricator.wikimedia.org/T131177) (owner: 10Physikerwelt) [06:38:11] (03PS1) 10Elukey: Add a space after each memcached extra command line option to ensure proper settings. [puppet] - 10https://gerrit.wikimedia.org/r/288880 (https://phabricator.wikimedia.org/T129963) [06:40:53] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:43:46] (03CR) 10Elukey: "Puppet compiler: https://puppet-compiler.wmflabs.org/2802/" [puppet] - 10https://gerrit.wikimedia.org/r/288880 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [06:49:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Add a space after each memcached extra command line option to ensure proper settings. [puppet] - 10https://gerrit.wikimedia.org/r/288880 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [06:54:55] (03PS1) 10Jcrespo: Take db1027 out of production needed for unracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288882 (https://phabricator.wikimedia.org/T135253) [06:56:10] (03CR) 10Jcrespo: [C: 032] Take db1027 out of production needed for unracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288882 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [06:56:14] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:14] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:14] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:28] (03CR) 10Santhosh: [C: 031] "Package name is correct" [puppet] - 10https://gerrit.wikimedia.org/r/287181 (https://phabricator.wikimedia.org/T33950) (owner: 10Muehlenhoff) [06:56:52] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Take db1027 out of production (duration: 00m 26s) [06:57:34] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:11] <_joe_> !log restarted hhvm on mw1161, stuck in HPHP::Treadmill::getAgeOldestRequest [07:07:22] RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 222 bytes in 0.019 second response time [07:25:24] (03CR) 10Mobrovac: [C: 031] Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [07:25:47] mobrovac: thanks! --^ [07:25:57] np [07:26:26] also thanks _joe_ for the review, will merge in a bit with puppet disabled just in case [07:32:13] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [07:36:42] (03PS1) 10Mobrovac: Change prop: Activate back the service [puppet] - 10https://gerrit.wikimedia.org/r/288883 [07:37:47] elukey: you should also run the puppet compiler on aqs1001 to make sure no changes will happen there [07:38:00] (even though it's clear from the script there won't be any changes) [07:38:07] s/script/patch/ [07:48:41] mobrovac: yep yep sure, I ran it before the last change and it looked good but before merging I'll double check again [07:56:24] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:01:38] !log puppet disabled on mc* hosts as prep step for https://gerrit.wikimedia.org/r/#/c/288880 (precaution, not really needed) [08:01:54] (03CR) 10Gehel: [C: 031] add gehel to wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/288735 (owner: 10Dzahn) [08:02:52] (03CR) 10Elukey: [C: 032] Add a space after each memcached extra command line option to ensure proper settings. [puppet] - 10https://gerrit.wikimedia.org/r/288880 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [08:08:50] !log restarted memcached on mc1007 to ensure that https://gerrit.wikimedia.org/r/288880 was applied and working correctly. Will not do the same thing with the other mc hosts. [08:09:41] all right now I can see "STAT slab_reassign yes" on mc1007, gooood [08:11:03] ok proceeding with the other mc hosts, all good [08:12:49] (03CR) 10KartikMistry: [C: 031] Add fonts-smc (Malayalam) to image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/287181 (https://phabricator.wikimedia.org/T33950) (owner: 10Muehlenhoff) [08:22:36] (03PS1) 10Elukey: Re-enable refresh on unit file change for memcached. [puppet] - 10https://gerrit.wikimedia.org/r/288886 (https://phabricator.wikimedia.org/T129963) [08:26:29] (03CR) 10Elukey: "Puppet compiler https://puppet-compiler.wmflabs.org/2803/" [puppet] - 10https://gerrit.wikimedia.org/r/288886 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [08:29:30] !log puppet disabled on mc* hosts as prep step for https://gerrit.wikimedia.org/r/#/c/288886 (precaution, not really needed) [08:30:34] (03CR) 10Elukey: [C: 032] Re-enable refresh on unit file change for memcached. [puppet] - 10https://gerrit.wikimedia.org/r/288886 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [08:32:17] (03CR) 10Merlijn van Deen: [C: 032] Figure out proper webservice type before emitting warnings [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/288875 (https://phabricator.wikimedia.org/T135349) (owner: 10Yuvipanda) [08:33:16] (03Merged) 10jenkins-bot: Figure out proper webservice type before emitting warnings [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/288875 (https://phabricator.wikimedia.org/T135349) (owner: 10Yuvipanda) [08:40:58] (03PS1) 10Mobrovac: Graphoid: Do not use the proxy for the allowed domains [puppet] - 10https://gerrit.wikimedia.org/r/288890 (https://phabricator.wikimedia.org/T134241) [08:44:54] (03PS2) 10Volans: MariaDB: update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/288361 (https://phabricator.wikimedia.org/T133780) [08:45:19] !log memcached on mc101[0123] got restarted because puppet did run gerrit/288880 and gerrit/288886 at the same time (operators fault of course) [08:46:10] _joe_ --^ sorry, my bad, I didn't run puppet on all the hosts and didn't think to wait long enough between merges. [08:46:42] also I just realized that my prev messages didn't get logged [08:47:08] <_joe_> man... start with codfw [08:47:19] <_joe_> oh I see [08:47:26] <_joe_> ok, happens :) [08:47:39] yeah totally my fault, and I took all the precautions not to happen [08:47:40] <_joe_> we lost just 20% of our caching [08:48:00] moreover it might also be the same for other 4 hosts, 101[4567], I am checking [08:48:14] I need to think 3 times before doing these things [08:48:29] maybe I can do them staggered throught the day [08:48:39] !log Temporary disabling Puppet on db*, es*, pc*, labsdb*, labservices*, silver, holmium to merge change 288361 T133780 [08:49:03] no log bot?!?!? [08:49:08] I was about to say the same [08:51:14] I'll take a look at morebots [08:51:20] godog: o/ [08:51:43] (03PS3) 10Alexandros Kosiaris: Include 10.196.* in site 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/288767 (owner: 10Andrew Bogott) [08:51:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Include 10.196.* in site 'codfw' [puppet] - 10https://gerrit.wikimedia.org/r/288767 (owner: 10Andrew Bogott) [08:52:51] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2296469 (10jcrespo) 05duplicate>03Open [08:53:50] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2296471 (10jcrespo) [08:53:52] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2288166 (10jcrespo) [08:56:39] (03CR) 10Mobrovac: "The PCC is happy - https://puppet-compiler.wmflabs.org/2804/" [puppet] - 10https://gerrit.wikimedia.org/r/288890 (https://phabricator.wikimedia.org/T134241) (owner: 10Mobrovac) [08:58:17] (03PS2) 10Alexandros Kosiaris: ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 [08:58:19] (03PS3) 10Alexandros Kosiaris: Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 [09:00:06] mobrovac: hi [09:00:12] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 685 [09:00:14] no.. sorry that was morebots [09:00:20] morebots: hi [09:00:21] I am a logbot running on tools-exec-1220. [09:00:21] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [09:00:21] To log a message, type !log . [09:00:26] lol [09:00:46] !log Temporary disabling Puppet on db*, es*, pc*, labsdb*, labservices*, silver, holmium to merge change 288361 T133780 [09:00:51] let's see now :D [09:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:08] yup, seems to be working [09:01:16] no idea how to backfill https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:18] yep! [09:01:20] thanks a lot [09:01:27] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2296474 (10elukey) I made a mistake and didn't leave enough time between 288880 and 288886 for the mc101* hosts on which I didn't run puppet manually, and some... [09:01:34] np, I don't see stashbot either [09:04:03] I've bounced stashbot, at some point it might rejoin [09:05:45] godog: ceasiest way to backfil is just manually copy irc logs onto the page [09:07:12] 06Operations, 10Graphoid, 06Services, 13Patch-For-Review, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2296477 (10mobrovac) [Gerrit 288890](https://gerrit.wikimedia.org/r/288890) tells Graph... [09:09:04] (03PS3) 10Volans: MariaDB: update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/288361 (https://phabricator.wikimedia.org/T133780) [09:09:54] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2296478 (10doctaxon) Hi! Thanks to jcrespo for reopening: May 15th: 17:50 to 20:30 UTC host db1026 on dewiki was very laggy again. @hoo wrote in IRC, that it's because huge amounts of edits on wikidata and he proba... [09:10:12] RECOVERY - check_mysql on lutetium is OK: Uptime: 475398 Threads: 1 Questions: 7894256 Slow queries: 12166 Opens: 63869 Flush tables: 2 Open tables: 64 Queries per second avg: 16.605 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:10:13] (03PS2) 10Mobrovac: Graphoid: Do not use the proxy for the allowed domains [puppet] - 10https://gerrit.wikimedia.org/r/288890 (https://phabricator.wikimedia.org/T134241) [09:10:42] (03CR) 10Volans: [C: 032] MariaDB: update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/288361 (https://phabricator.wikimedia.org/T133780) (owner: 10Volans) [09:12:53] I got a strange error after puppet-merge on the strontium side... git says that the identity is not set... investigating [09:13:13] RECOVERY - configured eth on rutherfordium is OK: OK - interfaces up [09:13:13] RECOVERY - DPKG on rutherfordium is OK: All packages OK [09:14:33] RECOVERY - RAID on rutherfordium is OK: OK: no RAID installed [09:14:33] RECOVERY - dhclient process on rutherfordium is OK: PROCS OK: 0 processes with command name dhclient [09:14:53] RECOVERY - HTTP-peopleweb on rutherfordium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 520 bytes in 0.018 second response time [09:14:53] RECOVERY - Disk space on rutherfordium is OK: DISK OK [09:14:54] RECOVERY - Check size of conntrack table on rutherfordium is OK: OK: nf_conntrack is 0 % full [09:14:54] RECOVERY - salt-minion processes on rutherfordium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:15:03] RECOVERY - NTP on rutherfordium is OK: NTP OK: Offset 0.0122936964 secs [09:15:18] wait, did I do that^ [09:15:40] I only did a "console" [09:15:51] same issue than usual? [09:16:32] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [09:17:24] iirc yeah, sometimes it gets fixed by doing console [09:18:28] 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2296486 (10Danny_B) [09:19:05] 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2280304 (10Danny_B) Submitting Special:Emailuser hit the same issue as well -> title change. [09:19:42] p858snake|L_: indeed, I'll do that, thanks! [09:20:44] {{done}} [09:20:54] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:20:54] godog: do you know how to fix strontium by any chance? [09:21:22] I'm not sure about the config there for the submodules [09:21:52] volans: mh I usually "fix" it with "sudo -u gitpuppet ssh strontium.eqiad.wmnet" from palladium whenever puppet-merge fails [09:22:15] basically hello_it_off_and_on_again.gif [09:23:03] great, seemed to work, while my usual workaround didn't so your's is better :) [09:23:06] thanks a lot [09:23:45] godog, can I add that to the puppet page? [09:24:16] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2008-b [puppet] - 10https://gerrit.wikimedia.org/r/288894 (https://phabricator.wikimedia.org/T132976) [09:24:27] the actual question is, volans, what are the synthoms? [09:24:32] for sure jynus [09:25:50] depends jynus, usually fails with this: T128895 today was different, complained about a missing identity (git config user.email/name) [09:25:50] T128895: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895 [09:27:38] ah, conflic with filipo [09:27:42] (03PS2) 10Filippo Giunchedi: cassandra: add restbase2008-b [puppet] - 10https://gerrit.wikimedia.org/r/288894 (https://phabricator.wikimedia.org/T132976) [09:27:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2008-b [puppet] - 10https://gerrit.wikimedia.org/r/288894 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [09:28:23] ehhe wiki sniping [09:31:09] !log restarted mysql on db2034 to test merged change 288361 , T133780 [09:31:10] T133780: Multiple Puppet class make MySQL load /etc/my.cnf twice - https://phabricator.wikimedia.org/T133780 [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:18] 06Operations: rutherfordium (powers people.wikimedia.org) flapping in channel - https://phabricator.wikimedia.org/T135330#2296514 (10Peachey88) 05Open>03Resolved a:03jcrespo ``` RECOVERY - configured eth on rutherfordium is OK: OK - interfaces up RECOVERY - DPKG on rutherfordium is... [09:32:21] 06Operations: Morebots be dead - https://phabricator.wikimedia.org/T135355#2296517 (10Peachey88) 05Open>03Resolved a:03fgiunchedi ``` no log bot?!?!? I was about to say the same I'll take a look at morebots godog: o/ * morebots (~morebots@208.80.155.255) has joined !log Slowly re-enabling Puppet on db*, es*, pc*, labsdb*, labservices*, silver, holmium after merged change 288361 T133780 [09:37:58] T133780: Multiple Puppet class make MySQL load /etc/my.cnf twice - https://phabricator.wikimedia.org/T133780 [09:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Inline comments" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [09:49:36] (03PS1) 10Jcrespo: Set oathauth_users as a private table [puppet] - 10https://gerrit.wikimedia.org/r/288897 (https://phabricator.wikimedia.org/T130700) [09:51:51] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2296612 (10elukey) Tried to re-install Debian on aqs1006 and I was able to boot correctly, but indeed the receipe is not doing what I need: ``` root@aqs1006:~# cat /... [09:54:15] (03CR) 10Alexandros Kosiaris: [C: 032] Graphoid: Do not use the proxy for the allowed domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288890 (https://phabricator.wikimedia.org/T134241) (owner: 10Mobrovac) [09:54:21] (03PS3) 10Alexandros Kosiaris: Graphoid: Do not use the proxy for the allowed domains [puppet] - 10https://gerrit.wikimedia.org/r/288890 (https://phabricator.wikimedia.org/T134241) (owner: 10Mobrovac) [09:54:28] (03CR) 10Alexandros Kosiaris: [V: 032] Graphoid: Do not use the proxy for the allowed domains [puppet] - 10https://gerrit.wikimedia.org/r/288890 (https://phabricator.wikimedia.org/T134241) (owner: 10Mobrovac) [09:56:04] (03PS2) 10Jcrespo: Set oathauth_users as a private table [puppet] - 10https://gerrit.wikimedia.org/r/288897 (https://phabricator.wikimedia.org/T130700) [10:00:33] 06Operations, 06Services, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2296627 (10fgiunchedi) supposedly just moving cassandra's data directory to a different path and use `cassandra.replace_address` option should just work to effective... [10:05:58] (03CR) 10Jcrespo: [C: 032] Set oathauth_users as a private table [puppet] - 10https://gerrit.wikimedia.org/r/288897 (https://phabricator.wikimedia.org/T130700) (owner: 10Jcrespo) [10:08:14] !log applying puppet and restarting sanitarium instances to apply new filter (temporary labs lag) [10:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:44] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: Connection refused [10:09:08] volans, ok with me to enable puppet on db1069 so I can test your and my change at the same time? [10:09:27] jynus: sure go ahead, I've already re-enabled all codfw and monitoring, all looks good so far [10:11:56] /etc/mysql/my.cnf: No such file or directory [10:12:07] it disappeared! [10:12:48] yep! as expected, /etc/my.cnf is the one we keft [10:12:50] *left [10:16:43] s4 and s6 replication, stuck [10:16:49] on sanitarium [10:17:23] have I ever said how much I hate toku? [10:18:07] I think so, yeah :D [10:18:35] it worked in the end, it only took 4 minutes to run stop slave [10:21:49] 06Operations, 10Graphoid, 06Services, 13Patch-For-Review, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2296686 (10fgiunchedi) >>! In T134241#2292678, @mobrovac wrote: >>>! In T134241#2292578... [10:26:53] volans, so that you do not have to fight with me every single time, db1069 init script has a bug in which it uses relative paths (or it does not do a cd), remember to cd to /opt/wmf-mariadb10 [10:27:10] *fight with it like me [10:27:42] ok, good to know! let me see if we have a sanitarium page where to put it [10:27:53] actually, I will try to solve it now [10:28:00] instead of documenting it [10:28:48] I am fighting with another bug (lack of innodb_buffer_pool_load) [10:36:12] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2296700 (10elukey) Had a chat with @Volans and after seeing what fdisk shows it the partman recipe looks wrong. Each disk has the following layout: ``` Device Bo... [10:37:19] (03PS1) 10Jcrespo: Fix paths for mysql init file at sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/288908 [10:37:36] (03PS2) 10Jcrespo: Fix paths for mysql init file at sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/288908 [10:38:06] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: Connection refused Filippo Giunchedi bootstrap [10:39:13] ha! look at the comment: "Manually specifying --defaults-file because mysqld_multi will detect our /etc/mysql/my.cnf symlink as a second cnf and try to start each instance twice." [10:41:24] (03PS3) 10Jcrespo: Fix paths for mysql init file at sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/288908 [10:43:22] rotfl [10:43:28] so someone knew! [10:44:52] (03CR) 10Volans: [C: 04-1] "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288908 (owner: 10Jcrespo) [10:45:19] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2296703 (10hoo) The user who caused the db lag yesterday is now using the api's maxlag parameter, so I hope this no longer is a problem. [10:46:56] (03PS4) 10Jcrespo: Fix paths for mysql init file at sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/288908 [10:49:14] (03CR) 10Volans: Fix paths for mysql init file at sanitarium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288908 (owner: 10Jcrespo) [10:49:32] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2296706 (10jcrespo) If immediate issues are not likely to happen, I would wait for new new servers to be setup and then reevaluate. If wikidata is very prone to bot edits/imports/etc., it may make sense to separate... [10:52:52] (03PS5) 10Jcrespo: Fix paths for mysql init file at sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/288908 [11:00:19] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/288908 (owner: 10Jcrespo) [11:01:11] (03CR) 10Jcrespo: [C: 032] Fix paths for mysql init file at sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/288908 (owner: 10Jcrespo) [11:04:58] 06Operations, 10Citoid, 10Graphoid, 06Services, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#2296738 (10mobrovac) [11:05:03] 06Operations, 10Graphoid, 06Services, 13Patch-For-Review, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2296737 (10mobrovac) [11:05:48] volans, it works, but I am not sure if printing the dir is something I want [11:07:32] jynus: > /dev/null or back to cd if you prefer, it's printed because it's usually confusing if a script change dir, in particular if then it doesn't return to the original one (not in this case) [11:08:08] I will leave it as is [11:08:23] mysqld_multi shouldn't require cd in the first place [11:08:53] ofc [11:09:05] 06Operations, 10Graphoid, 06Services, 13Patch-For-Review, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2296753 (10mobrovac) 05Open>03Resolved a:03mobrovac Removed T97530 as a blocker a... [11:10:52] and AFAICS /usr/local/bin/mysqld_multi -> /opt/wmf-mariadb10/bin/mysqld_multi unpuppetized [11:11:05] I will directly delete it now [11:14:04] "Removes unnedded defaults-file" I do not know who this jcrespo is, but I cannot understand him [11:15:43] :-) [11:17:30] sudo salt 'db*' cmd.run 'ls -la /etc/mysql/my.cnf 2> /dev/null' [11:17:38] wrong window :D [11:20:05] !log testing create table on s7-master T130700 [11:20:06] T130700: Create central OATHAuth table for CentralAuth wikis - https://phabricator.wikimedia.org/T130700 [11:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:43] jynus: do you need puppet to be active and run on all s7 for this? [11:32:43] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2296769 (10faidon) [11:32:45] 06Operations, 10ops-codfw: ms-be2008.codfw.wmnet: slot=1 dev=sdl failed - https://phabricator.wikimedia.org/T131147#2296770 (10faidon) [11:42:37] 06Operations, 06Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2296791 (10faidon) [11:42:42] 06Operations, 06Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2158431 (10faidon) [11:47:28] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 3 days - https://phabricator.wikimedia.org/T135326#2296802 (10Joe) p:05Triage>03High [11:54:22] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2296822 (10Joe) 05Open>03stalled p:05Triage>03High [11:59:13] 06Operations, 10Traffic, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2296831 (10Joe) @BBlack this is fixed and resolved, I assume; assigning to you for now. [11:59:28] 06Operations, 10Traffic, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2296834 (10Joe) p:05Triage>03Normal a:03BBlack [12:00:15] 06Operations, 10Ops-Access-Requests, 10fundraising-tech-ops: Frack (boron and bismuth) access for Darian Patrick - https://phabricator.wikimedia.org/T135165#2296836 (10Joe) [12:04:01] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2251644 (10Joe) I think the people that need to access nova logs should be the same that are admins for contint. So I think we... [12:04:13] 06Operations, 06Project-Admins: Create #IRCecho project - https://phabricator.wikimedia.org/T134961#2296864 (10Danny_B) >>! In T134961#2295574, @Aklapper wrote: > Agreeing but should discuss name and description with its maintainers, e.g. one idea is to merge ircecho with tcpircbot... I've added #operations w... [12:14:45] 06Operations, 10Traffic, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2296903 (10BBlack) a:05BBlack>03fgiunchedi The fix is implemented, but I'm not sure if @fgiunchedi has finished cleanup of junk data in graphite itself. [12:16:10] (03PS1) 10Elukey: Revised aqs-cassandra-8ssd-2srv.cfg partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/288921 (https://phabricator.wikimedia.org/T133785) [12:19:14] (03CR) 10Elukey: [C: 032] Revised aqs-cassandra-8ssd-2srv.cfg partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/288921 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [12:41:09] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 3 days - https://phabricator.wikimedia.org/T135326#2296953 (10Dvorapa) [12:42:04] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 3 days - https://phabricator.wikimedia.org/T135326#2295392 (10Dvorapa) [12:44:43] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 3 days - https://phabricator.wikimedia.org/T135326#2295392 (10Bawolff) Doesn't the cron job only run every 3 days? [13:00:11] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274165 (10chasemp) >>! In T134716#2279676, @yuvipanda wrote: > Thanks @MoritzMuehlenhoff. I'll add a stub role later today. @yuvipanda ping, may want to kno... [13:07:45] (03PS1) 10BBlack: cache_upload: hack around a network load problem... [puppet] - 10https://gerrit.wikimedia.org/r/288926 [13:09:30] (03CR) 10BBlack: [C: 032] cache_upload: hack around a network load problem... [puppet] - 10https://gerrit.wikimedia.org/r/288926 (owner: 10BBlack) [13:13:10] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368#2297003 (10Krinkle) [13:17:03] elukey: mutante: https://phabricator.wikimedia.org/T132896 [13:17:17] (03PS1) 10Rush: admin: general service cluster group [puppet] - 10https://gerrit.wikimedia.org/r/288928 (https://phabricator.wikimedia.org/T134251) [13:17:19] (03PS1) 10Rush: admin: add sc-admins to sc(a|b) hosts [puppet] - 10https://gerrit.wikimedia.org/r/288929 (https://phabricator.wikimedia.org/T134251) [13:18:23] Krinkle: will double check thanks! [13:19:53] (03CR) 10Volans: [C: 032] Remove temporary certificate with both CAs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/288419 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [13:27:19] (03PS1) 10Elukey: Fixed aqs-cassandra-8ssd-2srv.cfg partman config. [puppet] - 10https://gerrit.wikimedia.org/r/288930 (https://phabricator.wikimedia.org/T133785) [13:27:45] (03CR) 10Elukey: [C: 032 V: 032] Fixed aqs-cassandra-8ssd-2srv.cfg partman config. [puppet] - 10https://gerrit.wikimedia.org/r/288930 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [13:29:05] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:05] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:31:06] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:31:06] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:06] (03PS2) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [13:33:02] is that all of citoid going down? mobrovac?^ [13:33:05] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:33:05] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:33:15] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:33:16] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:33:21] nrpe again [13:33:23] *sigh* [13:33:44] what the hell is up with those checks lately? [13:34:00] citoid's working just fine, fwiw [13:34:34] ok thanks for checking [13:34:44] <_joe_> mobrovac: I know, it's too transient to properly investigate [13:35:03] kk [13:36:08] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2297061 (10elukey) New recipe: - RAID10 between 8 disks, 10GB partitions (~40GB in total) - RAID0 between 4 disks, 5.7TB total - RAID0 between 4 disks, 5.7 TB total... [13:36:27] (03PS4) 10Andrew Bogott: Glance: Fix the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/288621 [13:38:51] (03CR) 10Andrew Bogott: [C: 032] Glance: Fix the glance image backup cron [puppet] - 10https://gerrit.wikimedia.org/r/288621 (owner: 10Andrew Bogott) [13:40:39] 06Operations, 10Traffic: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2297076 (10BBlack) [13:40:41] 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2297089 (10fgiunchedi) [13:43:18] (03PS1) 10BBlack: cache_upload: raise FE mem from 1/12 to 1/4 total [puppet] - 10https://gerrit.wikimedia.org/r/288935 (https://phabricator.wikimedia.org/T135384) [13:43:20] (03PS1) 10BBlack: cache_text: raise FE mem from 1/8 to 1/4 total [puppet] - 10https://gerrit.wikimedia.org/r/288936 (https://phabricator.wikimedia.org/T135384) [13:44:15] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2297112 (10BBlack) p:05Triage>03Normal [13:48:00] (03PS1) 10Andrew Bogott: Explicitly set $hiera_config to 'labtest' in realm labtest. [puppet] - 10https://gerrit.wikimedia.org/r/288937 [13:48:19] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Krinkle) Agreed. There is no added value from a tag here. If there is an issue with HTTP2, it should be filed under the relevant component. Usually Traffic or Operations. If the filer is unsur... [13:49:32] (03PS2) 10Andrew Bogott: Explicitly set $hiera_config to 'labtest' in realm labtest. [puppet] - 10https://gerrit.wikimedia.org/r/288937 [13:51:05] FYI if you are seeing some UNKNOWNs for graphite metrics in icinga it might be related to https://phabricator.wikimedia.org/T135385 [13:52:53] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 3 days - https://phabricator.wikimedia.org/T135326#2297118 (10Dvorapa) I don't know, but right now it is 4.5 dne [13:53:38] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 4 days - https://phabricator.wikimedia.org/T135326#2297119 (10Dvorapa) [13:55:58] (03PS1) 10Jcrespo: Reinstante db2058, deleted instead of db1058 [puppet] - 10https://gerrit.wikimedia.org/r/288939 [13:57:29] (03PS2) 10Jcrespo: Reinstate db2058, deleted instead of db1058 [puppet] - 10https://gerrit.wikimedia.org/r/288939 [13:58:58] (03PS2) 10Giuseppe Lavagetto: admin: general service cluster group [puppet] - 10https://gerrit.wikimedia.org/r/288928 (https://phabricator.wikimedia.org/T134251) (owner: 10Rush) [13:59:05] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/288939 (owner: 10Jcrespo) [13:59:16] <_joe_> grrrit-wm: were art thou? [13:59:33] <_joe_> uhm just irssi refresh fail [13:59:57] (03CR) 10Jcrespo: [C: 032] Reinstate db2058, deleted instead of db1058 [puppet] - 10https://gerrit.wikimedia.org/r/288939 (owner: 10Jcrespo) [13:59:59] (03CR) 10Yuvipanda: [C: 04-1] "No realm branching in modules!" [puppet] - 10https://gerrit.wikimedia.org/r/288937 (owner: 10Andrew Bogott) [14:01:04] (03PS3) 10Giuseppe Lavagetto: admin: general service cluster group [puppet] - 10https://gerrit.wikimedia.org/r/288928 (https://phabricator.wikimedia.org/T134251) (owner: 10Rush) [14:01:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: general service cluster group [puppet] - 10https://gerrit.wikimedia.org/r/288928 (https://phabricator.wikimedia.org/T134251) (owner: 10Rush) [14:02:14] (03PS2) 10Giuseppe Lavagetto: admin: add sc-admins to sc(a|b) hosts [puppet] - 10https://gerrit.wikimedia.org/r/288929 (https://phabricator.wikimedia.org/T134251) (owner: 10Rush) [14:03:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: add sc-admins to sc(a|b) hosts [puppet] - 10https://gerrit.wikimedia.org/r/288929 (https://phabricator.wikimedia.org/T134251) (owner: 10Rush) [14:05:41] (03PS1) 10Yuvipanda: notebook: Introduce stub role [puppet] - 10https://gerrit.wikimedia.org/r/288941 (https://phabricator.wikimedia.org/T134716) [14:05:49] 06Operations: Cleanup puppet from unneeded and empty single-service "roots" - https://phabricator.wikimedia.org/T135386#2297133 (10Joe) [14:06:51] (03CR) 10jenkins-bot: [V: 04-1] notebook: Introduce stub role [puppet] - 10https://gerrit.wikimedia.org/r/288941 (https://phabricator.wikimedia.org/T134716) (owner: 10Yuvipanda) [14:07:27] (03PS1) 10Jcrespo: Remove (almost) all references to db1027 on production puppet [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) [14:07:29] (03PS2) 10Yuvipanda: notebook: Introduce stub role [puppet] - 10https://gerrit.wikimedia.org/r/288941 (https://phabricator.wikimedia.org/T134716) [14:08:30] (03CR) 10Jcrespo: "Adding Volans as reviewer in case I do it again." [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:09:09] lol :) [14:10:11] (03CR) 10Yuvipanda: [C: 032] notebook: Introduce stub role [puppet] - 10https://gerrit.wikimedia.org/r/288941 (https://phabricator.wikimedia.org/T134716) (owner: 10Yuvipanda) [14:12:10] (03CR) 10Volans: "is coredb::config mapping in manigests/role/coredb.pp still used?" [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:13:46] (03PS1) 10Jcrespo: Remove all mentions to db1027, db2008 and db2009 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) [14:14:10] * YuviPanda manigests volans [14:14:48] (03CR) 10Jcrespo: "At the end, when we get rid of old s7-master and x1-slave + other miscs" [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:15:08] (03PS3) 10Andrew Bogott: Refactor the $hiera_config variable to come from hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288937 [14:16:00] (03CR) 10Volans: [C: 04-1] "One line left" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:16:02] (03PS1) 10Yuvipanda: notebook: Teach yuvi to use regexes properly [puppet] - 10https://gerrit.wikimedia.org/r/288946 [14:16:02] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: puppet fail [14:16:30] (03PS2) 10Jcrespo: Remove all mentions to db1027, db2008 and db2009 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) [14:16:56] (03CR) 10Jcrespo: "Wait, is it 26 or 27?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:17:36] it is 27 [14:18:40] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:19:11] (03CR) 10Jcrespo: [C: 031] Remove all mentions to db1027, db2008 and db2009 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:21:42] (03PS1) 10Giuseppe Lavagetto: admin: remove unused services-related roots [puppet] - 10https://gerrit.wikimedia.org/r/288950 (https://phabricator.wikimedia.org/T135386) [14:22:37] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:26:55] (03PS2) 10Giuseppe Lavagetto: Change prop: Activate back the service [puppet] - 10https://gerrit.wikimedia.org/r/288883 (owner: 10Mobrovac) [14:27:11] (03PS1) 10Elukey: Add new suggested memcached settings to mc1009 as part of perf experiment. [puppet] - 10https://gerrit.wikimedia.org/r/288951 (https://phabricator.wikimedia.org/T129963) [14:29:31] (03PS1) 10BBlack: ssl_ciphersuite: drop CAMELLIA [puppet] - 10https://gerrit.wikimedia.org/r/288952 [14:30:48] 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2297214 (10fgiunchedi) I've also noticed this seems to correlate with an elevated number of established tcp connections to the graphite machines. I've been dumping with `ss` every... [14:32:34] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2297217 (10yuvipanda) @chasemp done [14:34:44] (03CR) 10Faidon Liambotis: [C: 04-1] "Like Alex, I don't see the need for three nearly-identical parser functions. Moreover, I don't see the need for providing a function for t" [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [14:35:53] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2297222 (10chasemp) >>! In T133992#2296862, @Joe wrote: > I think the people that need to access nova logs should be the same th... [14:36:01] (03PS4) 10Andrew Bogott: Refactor the $hiera_config variable to come from hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288937 [14:36:24] apergos: I've merged, https://gerrit.wikimedia.org/r/#/c/288595/ so, you can revisit https://phabricator.wikimedia.org/T127793 [14:36:36] looking [14:36:59] ah it was on my list of things to test today, but ok, if it's already merged :-) [14:38:05] (03CR) 10Giuseppe Lavagetto: [C: 032] Change prop: Activate back the service [puppet] - 10https://gerrit.wikimedia.org/r/288883 (owner: 10Mobrovac) [14:38:23] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2297238 (10jcrespo) This is not a #DBA ticket, I made sure that replica was working and safe (filtered) literally hours after the wiki was setup: ``` $ mysql -... [14:39:38] apergos: ah, yet to be branch'ed. [14:39:44] !log general package updates on cache clusters (4.4.0 image, libtasn6-1, libjansson4, libidn11) [14:39:46] ah of course [14:39:49] apergos: so you can probably look tomorrow. [14:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:54] :) [14:40:04] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2297242 (10elukey) Installed successfully aqs 1004/5, but 1005 fails with: ``` Loading Linux 4.4.0-1-amd64 ... Loading initial ramdisk ... [ 0.113680] [Firmware B... [14:40:06] (03Abandoned) 1020after4: scap::target keyholder-managed ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [14:40:19] but anyways if you note on the ticket that it's there and merged I can be writing a tiny script for it and be testing before it hits a deployed branch [14:40:23] kart_: [14:41:05] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2297245 (10Krenair) >>! In T135029#2297238, @jcrespo wrote: > I suppose some cron/script maintenance on #labs may be failing to execute? I'm not aware of anythin... [14:41:23] apergos: Sure [14:42:25] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2297270 (10thcipriani) >>! In T133992#2296862, @Joe wrote: > I think the people that need to access nova logs should be the same... [14:42:38] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:45:15] !log starting varnish package upgrade on cache_maps [14:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:29] !log change-propagation deployed ef5f6ff55 [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:15] (03PS1) 10Urbanecm: Add 'deletedtext' right to "eliminator" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288956 (https://phabricator.wikimedia.org/T135370) [14:48:46] apergos: updated task. [14:48:52] thanks! [14:49:42] (03PS3) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [14:49:58] (03PS2) 10Yuvipanda: notebook: Teach yuvi to use regexes properly [puppet] - 10https://gerrit.wikimedia.org/r/288946 [14:50:31] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2297305 (10mmodell) 05Open>03declined [14:52:08] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1261-1283 - https://phabricator.wikimedia.org/T133798#2297310 (10Cmjohnson) [14:52:32] (03CR) 10Jcrespo: [C: 032] Remove all mentions to db1027, db2008 and db2009 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288945 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [14:52:44] (03CR) 10Yuvipanda: [C: 032] notebook: Teach yuvi to use regexes properly [puppet] - 10https://gerrit.wikimedia.org/r/288946 (owner: 10Yuvipanda) [14:53:15] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki have not received updates for 4 days - https://phabricator.wikimedia.org/T135326#2297313 (10Bawolff) The Job gets run every three days, but can take more than a day to get through the list. Although usually everything goes fast right up unti... [14:54:27] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove all mentions to db1027, db2008 and db2009 (duration: 00m 34s) [14:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:48] (03PS4) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [14:55:20] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1060 is CRITICAL: Connection refused [14:55:26] jynus: tell me when you've puppet-merge so I can go with mine [14:55:29] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3005 is CRITICAL: Connection refused [14:55:38] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2009 is CRITICAL: Connection refused [14:55:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove all mentions to db1027, db2008 and db2009 (duration: 00m 33s) [14:55:49] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1059 is CRITICAL: Connection refused [14:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:53] (03CR) 10Yuvipanda: [C: 04-1] "Tyo, otherwise lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288937 (owner: 10Andrew Bogott) [14:56:09] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1047 is CRITICAL: Connection refused [14:56:09] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1046 is CRITICAL: Connection refused [14:56:10] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2021 is CRITICAL: Connection refused [14:56:28] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3003 is CRITICAL: Connection refused [14:56:28] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3006 is CRITICAL: Connection refused [14:56:29] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4012 is CRITICAL: Connection refused [14:56:37] bblack: related ^ ? [14:56:39] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3004 is CRITICAL: Connection refused [14:56:39] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4011 is CRITICAL: Connection refused [14:56:52] probably [14:57:00] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4020 is CRITICAL: Connection refused [14:57:19] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2003 is CRITICAL: Connection refused [14:57:29] bblack: fwiw I've seen some more errors on commonswiki API from ApiQueryPageImages::execute in the last few minutes [14:58:16] volans: ? [14:58:32] the maps problem is a configuration issue, it will fix itself shortly and it's beta anyways heh [14:58:55] ok, just checking if could be related [14:58:59] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - mapslb6_443 - Could not depool server cp1059.eqiad.wmnet because of too many down! [14:59:06] the maps problem is maps-specific [14:59:12] (03CR) 10Hoo man: [C: 032] "trivial" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 (owner: 10Matěj Suchánek) [14:59:17] (03PS3) 10Hoo man: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 (owner: 10Matěj Suchánek) [14:59:18] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: puppet fail [14:59:19] bblack: beta_beta_Beta? :P [14:59:20] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.074 second response time [14:59:29] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.001 second response time [14:59:33] (03CR) 10Hoo man: [C: 032] Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 (owner: 10Matěj Suchánek) [14:59:38] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3005 is OK: HTTP OK: HTTP/1.1 200 OK - 153 bytes in 0.175 second response time [14:59:44] I don't know what we'll do for a safe cluster to do unsafe things on once maps isn't beta :) [14:59:53] (I guess, be more careful always) [14:59:58] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 151 bytes in 0.003 second response time [15:00:00] (03PS5) 10Andrew Bogott: Refactor the $hiera_config variable to come from hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288937 [15:00:18] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp1047 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.003 second response time [15:00:18] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.001 second response time [15:00:19] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2021 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.085 second response time [15:00:29] (03Merged) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288404 (owner: 10Matěj Suchánek) [15:00:39] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.179 second response time [15:00:39] PROBLEM - Maps HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:00:39] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4012 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.152 second response time [15:00:50] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3004 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.166 second response time [15:00:58] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4011 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.150 second response time [15:01:09] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [15:01:16] !log hoo@tin Synchronized wmf-config/Wikibase.php: Update Wikidata property blacklist (duration: 00m 25s) [15:01:19] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.153 second response time [15:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:20] (03PS2) 10BBlack: ssl_ciphersuite: drop CAMELLIA [puppet] - 10https://gerrit.wikimedia.org/r/288952 [15:02:51] (03CR) 10BBlack: [C: 032 V: 032] ssl_ciphersuite: drop CAMELLIA [puppet] - 10https://gerrit.wikimedia.org/r/288952 (owner: 10BBlack) [15:03:33] anomie ostriches thcipriani marktraceur Are we SWATing or not? And where is jouncebot? [15:03:42] (03PS2) 10Giuseppe Lavagetto: admin: remove unused services-related roots [puppet] - 10https://gerrit.wikimedia.org/r/288950 (https://phabricator.wikimedia.org/T135386) [15:03:44] jouncebot: next [15:03:44] In 1 hour(s) and 26 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160516T1630) [15:03:54] (03PS1) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [15:04:27] there should be a "jouncebot: current" command [15:04:44] I don't really SWAT, I should probably take my name off that list [15:04:50] I can SWAT: akosiaris Krenair ping for SWAT [15:04:56] (03PS2) 10BBlack: cache_upload: raise FE mem from 1/12 to 1/4 total [puppet] - 10https://gerrit.wikimedia.org/r/288935 (https://phabricator.wikimedia.org/T135384) [15:05:17] (03CR) 10Jcrespo: [C: 031] MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [15:05:18] thcipriani: I am around [15:05:33] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: raise FE mem from 1/12 to 1/4 total [puppet] - 10https://gerrit.wikimedia.org/r/288935 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [15:05:35] (03PS2) 10Thcipriani: wgCopyUploadProxy: Vary per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 (owner: 10Alexandros Kosiaris) [15:05:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 (owner: 10Alexandros Kosiaris) [15:05:51] (03PS5) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [15:05:53] hi thcipriani [15:05:57] (03CR) 10Alexandros Kosiaris: [C: 032] admin: remove unused services-related roots [puppet] - 10https://gerrit.wikimedia.org/r/288950 (https://phabricator.wikimedia.org/T135386) (owner: 10Giuseppe Lavagetto) [15:05:58] Krenair: hello [15:06:02] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.074 second response time [15:06:07] (03PS3) 10Alexandros Kosiaris: admin: remove unused services-related roots [puppet] - 10https://gerrit.wikimedia.org/r/288950 (https://phabricator.wikimedia.org/T135386) (owner: 10Giuseppe Lavagetto) [15:06:12] (03CR) 10Alexandros Kosiaris: [V: 032] admin: remove unused services-related roots [puppet] - 10https://gerrit.wikimedia.org/r/288950 (https://phabricator.wikimedia.org/T135386) (owner: 10Giuseppe Lavagetto) [15:07:10] (03Merged) 10jenkins-bot: wgCopyUploadProxy: Vary per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 (owner: 10Alexandros Kosiaris) [15:08:43] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: wgCopyUploadProxy: Vary per datacenter [[gerrit:287095]] (duration: 00m 26s) [15:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:55] ^ akosiaris check please [15:09:42] !log varnish package upgrade on cache_maps done [15:09:46] (03CR) 10Alex Monk: "like I8f984e51?" [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [15:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:13] (03PS2) 10Thcipriani: Set meta namespace names for jamwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288628 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [15:10:22] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.169 second response time [15:10:31] RECOVERY - Maps HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:10:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288628 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [15:10:51] (03CR) 10Alex Monk: "Hm, no. This does something different. A group to automatically include all human users with shell access to the host?" [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [15:11:00] (03PS2) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [15:11:35] (03Merged) 10jenkins-bot: Set meta namespace names for jamwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288628 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [15:11:50] (03CR) 10Andrew Bogott: [C: 032] Refactor the $hiera_config variable to come from hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288937 (owner: 10Andrew Bogott) [15:11:56] (03PS6) 10Andrew Bogott: Refactor the $hiera_config variable to come from hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288937 [15:13:05] thcipriani: looks fine to me [15:13:11] akosiaris: kk, thanks [15:14:09] (03CR) 1020after4: "the functions add up to consistency in the way that keyholder is managed in puppet. The way it is is a mess and this patch was attempting " [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [15:14:20] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set meta namespace names for jamwiki [[gerrit:288628]] (duration: 00m 27s) [15:14:27] ^ Krenair check please [15:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:34] does that need namespaceDupes.php run? [15:14:50] didn't see any pages that would need it offhand. [15:15:03] let me see [15:15:42] thcipriani, ran the script, it fixed loads of pagelinks entries [15:15:46] well, 18 [15:15:53] not a huge amount compared to other wikis [15:16:35] Krenair: thanks [15:16:55] all looks good now [15:17:23] okie doke. [15:17:30] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2297431 (10Krenair) @Nikki, done [15:17:48] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2297432 (10mmodell) I'm not fighting for this any more. It's not worth it if nobody else sees th... [15:17:53] (03CR) 10Alexandros Kosiaris: [C: 031] "Premise looks fine to me. An inline comment about GID choise. It might also make sense to note the automagicness of the group in the commi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [15:18:16] (03PS2) 10Thcipriani: Enable blocking feature of AbuseFilter on ptwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288810 (https://phabricator.wikimedia.org/T134779) (owner: 10Urbanecm) [15:18:49] (03PS1) 10Urbanecm: Add an alias 'ΒΠ' for Project namespace in elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288965 (https://phabricator.wikimedia.org/T135383) [15:19:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288810 (https://phabricator.wikimedia.org/T134779) (owner: 10Urbanecm) [15:20:51] A question, could you, thcipriani, deploy 288965 with my other changes? I made it for T135383 which I found few seconds ago. Thanks! [15:20:51] T135383: Add an alias for Project namespace in elwiki - https://phabricator.wikimedia.org/T135383 [15:22:15] Urbanecm: add it to the deployment page, I'll circle back to it at the end of SWAT. [15:22:31] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2297446 (10Eevans) [15:22:46] if zuul picks up this last CR :\ [15:23:09] (03Merged) 10jenkins-bot: Enable blocking feature of AbuseFilter on ptwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288810 (https://phabricator.wikimedia.org/T134779) (owner: 10Urbanecm) [15:23:16] thcipriani: i've also got a patch up for swat [15:23:35] Thanks thcipriani! Added. [15:23:59] mobrovac: yup. I'll get through Urbanecm two original changes, then yours, then the new patch Urbanecm added. [15:24:08] kk [15:25:23] !log thcipriani@tin Synchronized wmf-config/abusefilter.php: SWAT: Enable blocking feature of AbuseFilter on ptwiktionary [[gerrit:288810]] (duration: 00m 24s) [15:25:25] ^ Urbanecm check please [15:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:55] (03PS2) 10Thcipriani: Add 'deletedtext' right to "eliminator" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288956 (https://phabricator.wikimedia.org/T135370) (owner: 10Urbanecm) [15:26:11] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288956 (https://phabricator.wikimedia.org/T135370) (owner: 10Urbanecm) [15:26:56] thcipriani: Working, thanks. [15:27:04] Urbanecm: thanks for checking [15:27:48] (03Merged) 10jenkins-bot: Add 'deletedtext' right to "eliminator" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288956 (https://phabricator.wikimedia.org/T135370) (owner: 10Urbanecm) [15:28:32] PROBLEM - Apache HTTP on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:38] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#1674039 (10Eevans) >>! In T113714#2296627, @fgiunchedi wrote: > supposedly just moving cassandra's data directory to a different path !!and use `cassa... [15:29:22] PROBLEM - HHVM rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:29] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add "deletedtext" right to "eliminator" user group on fawiki [[gerrit:288956]] (duration: 00m 26s) [15:29:31] ^ Urbanecm check please [15:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:52] (03PS4) 10Thcipriani: Enable MathML rendering by default on test, wikidata and dewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286180 (https://phabricator.wikimedia.org/T131177) (owner: 10Physikerwelt) [15:30:00] (03PS3) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [15:30:02] (03PS1) 10Giuseppe Lavagetto: peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 [15:30:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286180 (https://phabricator.wikimedia.org/T131177) (owner: 10Physikerwelt) [15:31:01] thcipriani: Working, thanks. [15:31:02] !log starting slow cache_upload frontend restarts (wipes) for cache size upgrades (~10H process) [15:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:21] Urbanecm: great! thanks for checking. [15:31:54] (03Merged) 10jenkins-bot: Enable MathML rendering by default on test, wikidata and dewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286180 (https://phabricator.wikimedia.org/T131177) (owner: 10Physikerwelt) [15:34:16] morebots: I'm going to sync out InitiliseSettings.php then CommonSettings.php one at at time to prevent log blow up (hopefully :)) [15:34:16] I am a logbot running on tools-exec-1220. [15:34:16] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:34:16] To log a message, type !log . [15:34:32] blerg, tab complete error: mobrovac ^ [15:35:21] kk Th [15:35:25] kk thcipriani [15:35:32] tab fail here too [15:35:32] :D [15:35:33] haha [15:35:47] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable MathML rendering by default on test, wikidata and dewikibooks PART I [[gerrit:286180]] (duration: 00m 26s) [15:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:48] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Enable MathML rendering by default on test, wikidata and dewikibooks PART II [[gerrit:286180]] (duration: 00m 24s) [15:36:52] ^ mobrovac check please [15:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:00] yup, checking [15:37:43] thcipriani: works! thnx [15:37:55] mobrovac: great—thanks for checking! [15:38:20] (03PS2) 10Thcipriani: Add an alias 'ΒΠ' for Project namespace in elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288965 (https://phabricator.wikimedia.org/T135383) (owner: 10Urbanecm) [15:38:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2297511 (10Joe) 05Open>03Resolved [15:38:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288965 (https://phabricator.wikimedia.org/T135383) (owner: 10Urbanecm) [15:39:35] (03Merged) 10jenkins-bot: Add an alias 'ΒΠ' for Project namespace in elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288965 (https://phabricator.wikimedia.org/T135383) (owner: 10Urbanecm) [15:41:50] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2297535 (10BBlack) cache_maps cluster switched to the new varnish package today [15:42:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add an alias "ΒΠ" for Project namespace in elwiki [[gerrit:288965]] (duration: 00m 26s) [15:42:13] ^ Urbanecm check please [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:11] thcipriani, working, thanks. [15:43:17] Urbanecm: thank you [15:44:22] (03CR) 10Alexandros Kosiaris: "I am not sure what is meant by fighting over this one. At least I have tried to give some constructive feedback on this one and Faidon has" [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [15:45:34] (03PS4) 10Alexandros Kosiaris: Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 [15:47:52] (03CR) 1020after4: "@alexandros: I appreciate the feedback, however, it doesn't seem like people like the approach (parser functions) and this patchset no lon" [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [15:49:39] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2297601 (10Nuria) 05Open>03Resolved [15:52:03] <_joe_> :q [15:52:06] <_joe_> yeah [15:55:32] <_joe_> :q/win 25 [15:55:36] <_joe_> oh man [15:59:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2297639 (10madhuvishy) [16:08:18] (03PS1) 10Andrew Bogott: Pick up private testings from labtest-private.yaml [puppet] - 10https://gerrit.wikimedia.org/r/288970 [16:09:25] (03PS1) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1013 [puppet] - 10https://gerrit.wikimedia.org/r/288971 (https://phabricator.wikimedia.org/T121562) [16:09:27] (03PS1) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/288972 (https://phabricator.wikimedia.org/T121562) [16:09:29] (03PS1) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1018 [puppet] - 10https://gerrit.wikimedia.org/r/288973 (https://phabricator.wikimedia.org/T121562) [16:09:31] (03PS1) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1020 [puppet] - 10https://gerrit.wikimedia.org/r/288974 (https://phabricator.wikimedia.org/T121562) [16:09:33] (03PS1) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/288975 (https://phabricator.wikimedia.org/T121562) [16:09:58] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2297698 (10BBlack) [16:10:39] 06Operations, 10Traffic, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280319 (10BBlack) Added HTTP/1.1 keepalive task as a blocker. We don't really need keepalives for this, but the work ongoing in that ticket is about resolving the issue with turning on... [16:12:15] (03CR) 10Ottomata: [C: 032] Set inter.broker.protocol = 0.9.0.x for kafka1013 [puppet] - 10https://gerrit.wikimedia.org/r/288971 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:15:01] !log restarting broker on kafka1013 [16:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:12] (03CR) 10Yuvipanda: [C: 031] Pick up private testings from labtest-private.yaml [puppet] - 10https://gerrit.wikimedia.org/r/288970 (owner: 10Andrew Bogott) [16:16:24] (03PS2) 10Andrew Bogott: Pick up private testings from labtest-private.yaml [puppet] - 10https://gerrit.wikimedia.org/r/288970 [16:17:33] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2297724 (10Danny_B) [16:18:43] (03CR) 10Ottomata: [C: 032] Set inter.broker.protocol = 0.9.0.x for kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/288972 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:19:06] (03CR) 10Andrew Bogott: [C: 032] Pick up private testings from labtest-private.yaml [puppet] - 10https://gerrit.wikimedia.org/r/288970 (owner: 10Andrew Bogott) [16:19:33] (03PS3) 10Andrew Bogott: Pick up private testings from labtest-private.yaml [puppet] - 10https://gerrit.wikimedia.org/r/288970 [16:19:56] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2297733 (10Eevans) It would appear the bootstrap of restbase2008-b.codfw.wmnet has encountered an error while... [16:20:18] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2286195 (10chasemp) Can someone update this task description? As-is I have no idea what's going on here. I know this is in large part because of my own lack of... [16:23:17] !log restarting kafka broker on kafka1014 [16:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:08] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1451751 (10chasemp) >>! In T105794#2294355, @BBlack wrote: > Announcement email (finally) sent! The cutoff dates/process are: > .... > > The Community team wi... [16:27:24] (03PS2) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1018 [puppet] - 10https://gerrit.wikimedia.org/r/288973 (https://phabricator.wikimedia.org/T121562) [16:27:49] (03CR) 10Ottomata: [C: 032] Set inter.broker.protocol = 0.9.0.x for kafka1018 [puppet] - 10https://gerrit.wikimedia.org/r/288973 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:27:57] (03CR) 10Ottomata: [V: 032] Set inter.broker.protocol = 0.9.0.x for kafka1018 [puppet] - 10https://gerrit.wikimedia.org/r/288973 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:28:02] PROBLEM - statsv process on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [16:29:39] (03CR) 10Yuvipanda: [C: 032 V: 032] Add 'hostautomounter' admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/288770 (owner: 10Yuvipanda) [16:29:50] (03CR) 10Yuvipanda: [C: 032 V: 032] uidenforcer: Set GID as well as UID [software/kubernetes] - 10https://gerrit.wikimedia.org/r/288771 (owner: 10Yuvipanda) [16:30:02] (03CR) 10Yuvipanda: [C: 032 V: 032] Add hostpathenforcer admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/288775 (owner: 10Yuvipanda) [16:30:04] gehel smalyshev: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160516T1630). [16:30:11] RECOVERY - statsv process on hafnium is OK: PROCS OK: 13 processes with command name python, args statsv [16:30:50] (03PS15) 10Eevans: Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [16:34:06] (03PS1) 10Yuvipanda: tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288983 [16:34:25] (03PS2) 10Yuvipanda: tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288983 [16:36:06] !log restarting kafka broker on kafka1018 [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:06] (03CR) 10Yuvipanda: [C: 032] tools: Bump k8s version [puppet] - 10https://gerrit.wikimedia.org/r/288983 (owner: 10Yuvipanda) [16:40:25] joal: +2 merge away [16:40:27] oops [16:40:28] wrong room [16:41:00] (03PS2) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1020 [puppet] - 10https://gerrit.wikimedia.org/r/288974 (https://phabricator.wikimedia.org/T121562) [16:41:16] (03CR) 10Ottomata: [C: 032 V: 032] Set inter.broker.protocol = 0.9.0.x for kafka1020 [puppet] - 10https://gerrit.wikimedia.org/r/288974 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:42:01] (03PS1) 10Cmjohnson: Adding dns entries for mw1284-1306 [dns] - 10https://gerrit.wikimedia.org/r/288984 [16:48:00] (03CR) 10Mobrovac: "The puppet compiler is pretty happy with the change - https://puppet-compiler.wmflabs.org/2814/ . Yet have to do a round of review." [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [16:54:44] !log restarted kafka broker on kafka1020 (about 5 mins ago) [16:54:46] (03CR) 10Dzahn: "i'll take a look, can create that in the private repo, rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [16:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:07] (03PS4) 10Dzahn: ores: Send icigna report to IRC [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [16:55:14] (03PS2) 10Ottomata: Set inter.broker.protocol = 0.9.0.x for kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/288975 (https://phabricator.wikimedia.org/T121562) [16:55:17] (03PS5) 10Dzahn: ores: Send icinga report to IRC [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [16:55:30] (03CR) 10Ottomata: [C: 032] Set inter.broker.protocol = 0.9.0.x for kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/288975 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:55:37] (03CR) 10Ottomata: [V: 032] Set inter.broker.protocol = 0.9.0.x for kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/288975 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:58:11] (03CR) 1020after4: "I will propose a new patch as soon as I can figure out an approach that will be acceptable to the operations team." [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [16:59:39] !log restarting kafka broker on kafka1022 [16:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:29] !log bounce cassandra-b on restbase2008-b to restart bootstrap [17:00:33] (03PS6) 10Dzahn: ores: Send icinga report to IRC [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [17:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:01:04] (03CR) 10Dzahn: [C: 032] "created icinga contact in private repo" [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [17:03:52] (03PS4) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [17:03:54] (03PS2) 10Giuseppe Lavagetto: peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 [17:04:08] <_joe_> akosiaris: rewritten, now it supposedly works :P [17:04:34] <_joe_> or maybe not, but it does the right thing [17:05:24] (03CR) 10jenkins-bot: [V: 04-1] admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [17:05:31] (03CR) 10jenkins-bot: [V: 04-1] peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 (owner: 10Giuseppe Lavagetto) [17:05:54] (03CR) 10Dzahn: [C: 031] "it's not like i have tested the parser function, but +1 for having this group for all users without privileges" [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [17:06:14] (03CR) 10Dzahn: [C: 031] peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 (owner: 10Giuseppe Lavagetto) [17:07:21] <_joe_> mutante: leave it for now [17:07:32] <_joe_> I have to do a few tweaks [17:07:44] ok, yes [17:09:03] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2297943 (10elukey) Really weird, after rebooting a couple of times: ``` Loading Linux 4.4.0-1-amd64 ... Loading initial ramdisk ... [ 0.113896] [Firmware Bug]: th... [17:09:13] (03PS5) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [17:09:15] (03PS3) 10Giuseppe Lavagetto: peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 [17:09:18] <_joe_> actually I need a pause, bbiab [17:09:31] icinga might report config issue, alreayd on it. will just need one more puppet run [17:10:30] (03CR) 10jenkins-bot: [V: 04-1] admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [17:10:35] (03CR) 10jenkins-bot: [V: 04-1] peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 (owner: 10Giuseppe Lavagetto) [17:14:42] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2297959 (10debt) Hi there - we need to update the stats on the wikipedia.org portal page on a regular basis. We've recently found that the script we run to do th... [17:22:27] (03PS6) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [17:23:22] (03CR) 10jenkins-bot: [V: 04-1] admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [17:25:07] (03CR) 10Dzahn: "tested, output seen in #wikimedia-ai after a direct echo into the logfile" [puppet] - 10https://gerrit.wikimedia.org/r/287590 (https://phabricator.wikimedia.org/T134726) (owner: 10Ladsgroup) [17:27:36] (03PS3) 10Dzahn: add gehel to wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/288735 [17:28:32] (03PS4) 10Dzahn: add gehel to wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/288735 [17:28:41] (03CR) 10Dzahn: [C: 032] add gehel to wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/288735 (owner: 10Dzahn) [17:31:33] (03PS7) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [17:32:51] (03PS3) 10Ottomata: Remove confluent conditional in role::kafka::analytics::broker [puppet] - 10https://gerrit.wikimedia.org/r/286708 (https://phabricator.wikimedia.org/T121562) [17:33:35] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2298022 (10Dzahn) Would you mind if we put base::firewall in the stub role? If we need to open any holes in it for madhuvishy i'm happy... [17:37:17] (03PS8) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [17:37:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274165 (10Ottomata) Dunno if this is how notebooks works, but I wouldn't be surprised if each notebook instance listened on a random no... [17:37:35] (03PS1) 10Dzahn: notebook: add base::firewall to stub role [puppet] - 10https://gerrit.wikimedia.org/r/288989 (https://phabricator.wikimedia.org/T134716) [17:37:51] (03CR) 10Ottomata: [C: 032] Remove confluent conditional in role::kafka::analytics::broker [puppet] - 10https://gerrit.wikimedia.org/r/286708 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [17:39:54] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1881753 (10Ottomata) AAAnnnd we are done! [17:42:33] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2298054 (10Ottomata) HMm, @elukey let's keep an eye on Broker Log Size: https://grafana.wikimedia.org/dashboard/db/kafka?pa... [17:43:19] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2298056 (10chasemp) per ops meeting: access approved, post setup and puppetization the boxes are to be reinstalled and long term access... [17:43:32] (03PS1) 10Dzahn: admin: create group notebook-roots [puppet] - 10https://gerrit.wikimedia.org/r/288990 (https://phabricator.wikimedia.org/T134716) [17:43:37] ottomata: got it --^ [17:43:48] aye :) [17:43:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2298059 (10madhuvishy) @Ottomata It doesn't do that :) There is a Proxy service that listens on all public interfaces on one port - it t... [17:44:20] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2298063 (10fgiunchedi) >>! In T113714#2297469, @Eevans wrote: >>>! In T113714#2296627, @fgiunchedi wrote: >> supposedly just moving cassandra's data di... [17:45:19] 06Operations: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2298064 (10CCogdill_WMF) [17:45:21] (03PS1) 10Andrew Bogott: Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 [17:45:45] (03CR) 10Andrew Bogott: "this needs testing on a labs instance -- could be dangerous." [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [17:46:29] (03CR) 10Rush: [C: 031] "I have previously put up https://gerrit.wikimedia.org/r/#/c/244471/ and done nothing with it. Even money on which approach is more sane. " [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [17:47:01] (03CR) 10Rush: [C: 031] admin: create group notebook-roots [puppet] - 10https://gerrit.wikimedia.org/r/288990 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [17:47:17] (03PS2) 10Dzahn: admin: create group notebook-roots [puppet] - 10https://gerrit.wikimedia.org/r/288990 (https://phabricator.wikimedia.org/T134716) [17:51:34] (03PS1) 10Dzahn: admin: give madhuvishy root on notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/288993 (https://phabricator.wikimedia.org/T134716) [17:52:33] (03PS9) 10Giuseppe Lavagetto: admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 [17:52:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: add all-users group [puppet] - 10https://gerrit.wikimedia.org/r/288957 (owner: 10Giuseppe Lavagetto) [17:53:43] <_joe_> wait before merging anything else please [17:54:31] (03PS4) 10Giuseppe Lavagetto: peopleweb: switch to all-users [puppet] - 10https://gerrit.wikimedia.org/r/288968 [17:59:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2298137 (10madhuvishy) @Dzahn I think base::firewall is fine - We'd need to open up a single port for the proxy service to listen to for... [18:04:01] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:31] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2298161 (10Eevans) >>! In T113714#2298063, @fgiunchedi wrote: >>>! In T113714#2297469, @Eevans wrote: >>>>! In T113714#2296627, @fgiunchedi wrote: >>>... [18:05:45] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/2811/rutherfordium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/288968 (owner: 10Giuseppe Lavagetto) [18:06:51] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:04] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2298170 (10Dzahn) @Madhuvishy Cool! Once we get to that point, let me know which port and from which sources we need to allow it and i'm... [18:08:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:09:56] (03PS2) 10Dzahn: notebook: add base::firewall to stub role [puppet] - 10https://gerrit.wikimedia.org/r/288989 (https://phabricator.wikimedia.org/T134716) [18:10:43] (03PS3) 10Dzahn: notebook: add base::firewall to stub role [puppet] - 10https://gerrit.wikimedia.org/r/288989 (https://phabricator.wikimedia.org/T134716) [18:10:50] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:11:00] (03CR) 10Dzahn: [C: 032] notebook: add base::firewall to stub role [puppet] - 10https://gerrit.wikimedia.org/r/288989 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [18:12:00] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:15:47] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2298231 (10chasemp) [18:15:49] (03CR) 10Dzahn: "yep, added iptables rules on puppet run on notebook1001" [puppet] - 10https://gerrit.wikimedia.org/r/288989 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [18:17:37] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2298268 (10MaxSem) [18:19:03] (03CR) 10Thcipriani: [C: 04-1] "Can't speak to most of it, obviously, there is one incorrect scap::target parameter. Comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [18:20:54] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2298285 (10MaxSem) I've edited the description back. The issue in this ticket is unavailability of adywiki replicas. The updates are discussed in a blocked ticke... [18:23:21] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2298299 (10BBlack) Note: the first upload cluster converted to 25% during a slow restart of all the frontends was cp3048, so it's had the longest runtime so far on the new... [18:24:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:25:09] 06Operations, 10Fundraising-Backlog: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2298303 (10DStrine) [18:32:25] 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2298366 (10Eevans) [18:32:47] (03PS1) 10BBlack: cache_upload: remove redundant backend ttl_cap [puppet] - 10https://gerrit.wikimedia.org/r/288999 [18:33:40] (03PS3) 10Dzahn: admin: create group notebook-roots [puppet] - 10https://gerrit.wikimedia.org/r/288990 (https://phabricator.wikimedia.org/T134716) [18:34:16] (03PS2) 10BBlack: cache_upload: remove redundant backend ttl_cap [puppet] - 10https://gerrit.wikimedia.org/r/288999 [18:35:24] (03CR) 10Dzahn: [C: 032] admin: create group notebook-roots [puppet] - 10https://gerrit.wikimedia.org/r/288990 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [18:35:34] mutante: thanks for dealing with the notebook stuff :) [18:35:44] YuviPanda: yw:) [18:36:07] (03PS1) 10BBlack: varnish::instance: remove unused arg xff_sources [puppet] - 10https://gerrit.wikimedia.org/r/289000 [18:36:30] (03CR) 10BBlack: [C: 032] "compiler no-op in VCL outputs" [puppet] - 10https://gerrit.wikimedia.org/r/288999 (owner: 10BBlack) [18:36:43] (03PS3) 10BBlack: cache_upload: remove redundant backend ttl_cap [puppet] - 10https://gerrit.wikimedia.org/r/288999 [18:36:53] (03CR) 10BBlack: [V: 032] cache_upload: remove redundant backend ttl_cap [puppet] - 10https://gerrit.wikimedia.org/r/288999 (owner: 10BBlack) [18:37:04] (03PS2) 10BBlack: varnish::instance: remove unused arg xff_sources [puppet] - 10https://gerrit.wikimedia.org/r/289000 [18:37:11] (03CR) 10BBlack: [C: 032 V: 032] varnish::instance: remove unused arg xff_sources [puppet] - 10https://gerrit.wikimedia.org/r/289000 (owner: 10BBlack) [18:44:14] 06Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2298394 (10RobH) [18:51:08] (03PS1) 10Dzahn: notebook: add debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/289001 [18:51:40] (03PS2) 10Bartosz Dziewoński: Prepare Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287959 (https://phabricator.wikimedia.org/T134775) [18:52:26] (03PS2) 10Dzahn: notebook: add debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/289001 [18:53:26] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2298442 (10chasemp) >>! In T135029#2297238, @jcrespo wrote: > This is not a #DBA ticket, I made sure that replica was working and safe (filtered) literally hours... [18:54:10] 06Operations, 06Performance-Team, 07Availability: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2298451 (10aaron) a:05aaron>03None [18:55:28] (03PS2) 10Dzahn: admin: give madhuvishy root on notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/288993 (https://phabricator.wikimedia.org/T134716) [18:56:28] (03PS3) 10Dzahn: notebook: add debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/289001 (https://phabricator.wikimedia.org/T134716) [18:57:27] (03CR) 10Dzahn: [C: 032] "has manager approval, has ops approval, has waited long enough, existing shell access to no new docs to be signed.." [puppet] - 10https://gerrit.wikimedia.org/r/288993 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [18:57:33] (03Abandoned) 10Krinkle: Remove duplicate lru_crawler option from mc[12]009 [puppet] - 10https://gerrit.wikimedia.org/r/287178 (owner: 10Ori.livneh) [18:58:18] 06Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2298467 (10RobH) Ok, chatting with Chris we are good to add these to spares. However, restbase1001 isn't called that anymore, so it has been used for something else since. I'm adding the others... [19:04:32] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2298487 (10RobH) [19:04:34] 06Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2298484 (10RobH) 05Open>03Resolved So graphite1003 was once restbase1001, figured it out, resolving this task. spares have been added to the spares tracking sheet. [19:04:59] madhuvishy: your user has been created on notebook1001/1002 ..now. i see you already have shell on other servers, so you can just jump to these now via the bastion too [19:05:37] mutante: thanks a ton! I'm able to log in :D [19:06:44] madhuvishy: :) sudo cat /etc/sudoers.d/notebook-roots [19:09:14] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2298502 (10Dzahn) 05Open>03Resolved a:03Dzahn 12:10 < mutante> madhuvishy: your user has been created on notebook1001/1002 ..now.... [19:29:36] (03CR) 10Andrew Bogott: "I've cherry-picked this onto a labs instance and confirmed that it's a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [19:37:59] (03CR) 10Yuvipanda: [C: 031] Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [19:38:41] (03PS2) 10Andrew Bogott: Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 [19:40:24] (03CR) 10Rush: [C: 04-1] Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [19:41:35] (03PS1) 10Milimetric: Support additional reportupdater directories [puppet] - 10https://gerrit.wikimedia.org/r/289007 (https://phabricator.wikimedia.org/T126549) [19:44:10] (03PS3) 10Andrew Bogott: Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 [19:44:32] (03PS4) 10Rush: Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [19:45:04] (03CR) 10Rush: [C: 031] Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [19:46:32] (03CR) 10Andrew Bogott: [C: 032] Set the proper .tld in labtest resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/288991 (owner: 10Andrew Bogott) [19:55:38] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2298923 (10Dzahn) The contint-admins group already has this: 159 'ALL = NOPASSWD: /bin/journalctl*', But tha... [19:59:57] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2298927 (10Dzahn) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160516T2000). [20:00:07] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2277377 (10Dzahn) p:05Triage>03Normal [20:00:35] no mobileapps deployment today [20:04:52] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:06:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:07:01] no parsoid deploy today [20:07:48] andrewbogott: ^ unless you meant to wait with merging [20:08:18] mutante: ok, please merge, sorry [20:08:34] ok, done [20:08:44] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:09:17] (03PS1) 10BBlack: VCL: X-Cache simplification [puppet] - 10https://gerrit.wikimedia.org/r/289015 [20:10:22] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:12:11] thcipriani: on labnodepool1001: sudo journalctl -u openstack-nova-api [20:12:35] thcipriani: i think you might already have that log access without a change [20:12:46] * thcipriani checks [20:13:21] the contint-admins group is on that host and has journalctl with a wildcard [20:13:31] it should work on any unit/service [20:13:53] oh. yeah, didn't prompt me on that box. [20:13:56] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2298968 (10Dzahn) [labnodepool1001:~] $ id thcipriani uid=11634(thcipriani) gid=500(wikidev) groups=500(wikidev),719(contint-adm... [20:14:52] thcipriani: :) i think the ticket was pre-resolved [20:16:08] if the image service part is also a unit [20:17:54] hmm, I'm trying to remember the context of that problem. I think there is another machine where contint-admins aren't added that we wanted access to. /me is digging through IRC log [20:18:12] or maybe the group was added to labnodepool at some point [20:18:26] historically that was just on gallium [20:18:40] and i expected for a moment then the ticket is to add the group to the host [20:18:49] (03PS1) 10BBlack: VCL: No X-Cache for PURGE in Varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/289018 [20:19:32] (03PS1) 10Andrew Bogott: Set labs_tld: "labtest" for labtest instances. [puppet] - 10https://gerrit.wikimedia.org/r/289019 [20:19:37] hmm. not that recent [20:19:42] it was added on Jul 31 2015 [20:20:54] ah, so looking at my irc logs it was the machine labnet1002 that we were stuck looking at as a black box. [20:21:26] aha! this makes sense, since all the groups are added by role or hosts in hiera [20:21:50] on labnodepool1001 we could see via nova list that it was trying to make connections to labnet1002, but couldn't see anything beyond that. [20:21:51] (03CR) 10Andrew Bogott: [C: 032] Set labs_tld: "labtest" for labtest instances. [puppet] - 10https://gerrit.wikimedia.org/r/289019 (owner: 10Andrew Bogott) [20:21:54] could you comment on that ticket which hosts it is for [20:22:00] sure thing. [20:22:02] i'll upload a change to add it to that [20:23:32] "allow releng nova log access" -> "add contint-admins on labtest" [20:23:41] looks [20:24:21] heh, "labtest" [20:24:23] labtestcontrol2001.yaml labtestneutron2001.yaml labtestvirt2001.yaml [20:24:26] labtestnet2001.yaml labtestservices2001.yaml labtestweb2001.yaml [20:24:45] :p we'll need to figure out which of these for real [20:26:30] labnet != labtest i'm just confused [20:26:40] oh boy :) [20:30:42] can't add by role, "labnet" doesnt have one, need to add by host name [20:31:49] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2298978 (10thcipriani) When we were troubleshooting nodepool problems, we could tail logs from `labnodepool1001`. The permission... [20:33:15] (03PS1) 10Dzahn: admin: add contint-admins on labnet100[12] [puppet] - 10https://gerrit.wikimedia.org/r/289020 (https://phabricator.wikimedia.org/T133992) [20:33:44] thcipriani: ^ so if "give contint-admins shell on labnet1001/1002 and let them read logs there" sounds right.. then that ^ [20:34:02] or even another group [20:34:20] will add some reviewers [20:36:08] (03PS2) 10Dzahn: admin: add contint-admins on labnet100[12] [puppet] - 10https://gerrit.wikimedia.org/r/289020 (https://phabricator.wikimedia.org/T133992) [20:36:08] mutante: hmmm we may need another group, that may give us a lot of permissions we don't necessarily want. I'd like hashar to take a look at this too: make sure I'm not missing anything. [20:37:15] thcipriani: yes, same here. i already dont like it myself, heh [20:37:21] :( [20:37:25] er :) [20:37:37] (wrong paren) [20:37:40] also the part that it's not by role [20:37:45] because labtest doesnt have one yet [20:37:58] then if we add labtest1003 it's gonna be an issue again [20:38:05] labnet damnit [20:39:01] (03PS48) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [20:39:52] ok, well we'll reverify that we're all on the same page in the ticket. request was a little squishy, sorry about that. thanks for the patch! [20:39:59] (03Abandoned) 10Dzahn: admin: add contint-admins on labnet100[12] [puppet] - 10https://gerrit.wikimedia.org/r/289020 (https://phabricator.wikimedia.org/T133992) (owner: 10Dzahn) [20:40:36] thcipriani: ok, yep, on the ticket is great. np [20:42:21] (03PS1) 10Ladsgroup: ores: precaching goes down sometimes, making it more verbose [puppet] - 10https://gerrit.wikimedia.org/r/289036 (https://phabricator.wikimedia.org/T135444) [20:54:33] (03PS6) 10Andrew Bogott: Remove a couple of unused settings from ldap config [puppet] - 10https://gerrit.wikimedia.org/r/288536 [20:57:39] o/ I'd like to run a big, batch job to extract 2 Million original images from commons. [20:58:16] Running these requests in a single thread will take a very long time, so I'd like to see if it is possible to run on parallel requests for ~a week [20:58:48] Should I just set an easily identifiable user-agent and proceed or is there some other process that I should follow? [20:59:40] FWIW, I'm very familiar with setting up a process pool to limit the number of parallel requests. [21:01:51] (03CR) 10Andrew Bogott: [C: 032] Remove a couple of unused settings from ldap config [puppet] - 10https://gerrit.wikimedia.org/r/288536 (owner: 10Andrew Bogott) [21:04:12] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2299063 (10RobH) Please note that this allocation, via procurement task T131871, has been approved. Two of our spare systems wmf4657 & wmf465... [21:05:11] halfak: what is the latency of the end points you will hit? [21:05:32] gwicke, not sure. I haven't attempted this yet. Still looking into options. I can run some tests. [21:06:03] the latency should give you a good idea of the costs [21:06:28] as well as the needed parallelism to achieve total time X [21:07:00] gwicke, certainly. [21:07:14] definitely +1 on a unique UA, ideally a mail address [21:07:35] I assume you mean email. [21:08:15] yeah, "Aaron Halfaker, Minneapolis" might not be so useful ;) [21:10:00] (it would probably work, but if everybody did that...) [21:14:22] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2299085 (10EBernhardson) I'm thinking it will be simpler to give them a service cluster name, makes things easier to remember. If there are no... [21:15:23] My back of the napkin suggests that I can get about 1 image per second which puts the process at something like 24 days [21:15:55] We could tolerate running 4 requests in parallel and that would result in ~ 6 days of downloading. [21:17:21] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2299093 (10Dzahn) >>! In T134798#2294608, @yuvipanda wrote: > It's in relic.toolserver-legacy.eqiad.wmflabs I can't logi... [21:20:45] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2299101 (10Dzahn) please disregard. pebcak. using bastion.wmflabs and not bastion-restricted.wmflabs was the issue [21:25:36] (03PS4) 10Dzahn: add Letsencrypt cert/config for (www.)toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) [21:26:42] halfak: afaik, original loading is primarily using IO bandwidth on the swift cluster, but a parallelism of 4 isn't very high [21:26:48] (03PS6) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [21:26:49] I'd be surprised if that caused any issues [21:26:53] (03CR) 10Ottomata: Druid module and analytics_cluster role class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:26:58] (03PS1) 10BBlack: varnishxcps: prevent junk injection [puppet] - 10https://gerrit.wikimedia.org/r/289067 (https://phabricator.wikimedia.org/T135227) [21:27:26] gwicke, advising I give it a try with a good User-Agent? [21:27:47] (03PS7) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [21:27:53] (03CR) 10jenkins-bot: [V: 04-1] Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:28:01] halfak: yeah, I think that's totally fine [21:28:14] (03PS1) 10Andrew Bogott: /srv/glance should be owned by glance. [puppet] - 10https://gerrit.wikimedia.org/r/289069 [21:28:22] (03CR) 10Dzahn: [C: 032] add Letsencrypt cert/config for (www.)toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/288708 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [21:28:34] great. Will give it a shot. [21:28:40] halfak: gwicke actually, looking at https://grafana.wikimedia.org/dashboard/db/varnish-http-requests it looks like we are doing a few million per minute on the upload cache [21:28:51] so 4 parallel seems ok [21:28:54] heh. drop in the bucket then [21:28:58] :) [21:28:59] (03PS2) 10Andrew Bogott: /srv/glance should be owned by glance. [puppet] - 10https://gerrit.wikimedia.org/r/289069 [21:29:11] that's apples / oranges, as that's almost all thumbs [21:29:17] (03CR) 10jenkins-bot: [V: 04-1] Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:29:25] but still, four cache misses for originals isn't the end of the world [21:29:30] hmm that's true [21:29:33] I wonder if there are swift stats [21:29:53] https://grafana.wikimedia.org/dashboard/file/swift.json [21:29:55] doesn't hav requests [21:30:28] gwicke: halfak actually, I see only under 1 req/s there [21:30:46] (03CR) 10Andrew Bogott: [C: 032] /srv/glance should be owned by glance. [puppet] - 10https://gerrit.wikimedia.org/r/289069 (owner: 10Andrew Bogott) [21:31:00] maybe not what I think it is [21:31:06] ~500mbps network out [21:31:24] heh [21:31:26] yeah [21:31:30] (03PS2) 10BBlack: VCL: No X-Cache for PURGE in Varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/289018 [21:31:32] (03PS2) 10BBlack: VCL: X-Cache simplification [puppet] - 10https://gerrit.wikimedia.org/r/289015 [21:31:34] (03PS1) 10BBlack: [WIP] varnishxcache [puppet] - 10https://gerrit.wikimedia.org/r/289071 [21:35:58] (03PS1) 10Dzahn: toolserver.org: adjust puppet dependencies for cert change [puppet] - 10https://gerrit.wikimedia.org/r/289072 (https://phabricator.wikimedia.org/T134798) [21:36:32] (03PS8) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [21:36:59] (03CR) 10jenkins-bot: [V: 04-1] toolserver.org: adjust puppet dependencies for cert change [puppet] - 10https://gerrit.wikimedia.org/r/289072 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [21:37:47] (03CR) 10jenkins-bot: [V: 04-1] Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:38:28] (03PS2) 10Dzahn: toolserver.org: adjust puppet dependencies for cert change [puppet] - 10https://gerrit.wikimedia.org/r/289072 (https://phabricator.wikimedia.org/T134798) [21:40:06] (03CR) 10Dzahn: [C: 032] toolserver.org: adjust puppet dependencies for cert change [puppet] - 10https://gerrit.wikimedia.org/r/289072 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [21:40:50] (03PS9) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [21:43:07] (03PS2) 10BBlack: [WIP] varnishxcache [puppet] - 10https://gerrit.wikimedia.org/r/289071 [21:45:42] (03CR) 10Ottomata: "I made a few changes since your last review. Notably, I've made the main druid class smarter about defaults for things like MySQL metadata" [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:46:34] (03PS1) 10Dzahn: toolserver.org: break puppet dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/289074 (https://phabricator.wikimedia.org/T134798) [21:46:38] meeeep "Error: Could not apply complete catalog: Found 1 dependency cycle:" :p [21:46:47] (03CR) 10Ottomata: "Oh, BTW, the Cloudera Trusty packages installed just fine in Jessie!" [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:47:35] (03CR) 10Dzahn: [C: 032] toolserver.org: break puppet dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/289074 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [21:49:52] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2016-08-14 21:48:59 +0000 (expires in 89 days) [21:53:46] (03PS1) 10Andrew Bogott: Exchange the addresses of labs-recursor0 and labs-recursor1 [dns] - 10https://gerrit.wikimedia.org/r/289080 [21:53:51] (03PS1) 10Andrew Bogott: Exchange labs-recursor0 and labs-recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/289081 [21:56:31] 06Operations, 06Labs, 10Labs-Infrastructure: Get labs-ns0, labs-recursor0, and labservices1001 on the same system, and labs-ns1, labs-recursor1, and holmium on another - https://phabricator.wikimedia.org/T135447#2299210 (10Andrew) [21:56:47] (03PS2) 10Andrew Bogott: Exchange labs-recursor0 and labs-recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/289081 (https://phabricator.wikimedia.org/T135447) [21:56:53] (03PS2) 10Andrew Bogott: Exchange the addresses of labs-recursor0 and labs-recursor1 [dns] - 10https://gerrit.wikimedia.org/r/289080 (https://phabricator.wikimedia.org/T135447) [21:58:56] fwiw, toolserver.org gets a ton of hits for osm tiles, i suppose that should be changed somewhere to wmflabs.org [21:59:54] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Get labs-ns0, labs-recursor0, and labservices1001 on the same system, and labs-ns1, labs-recursor1, and holmium on another - https://phabricator.wikimedia.org/T135447#2299236 (10Andrew) Deployment plan: 1) disable puppet on holmium and on la... [22:01:01] (03CR) 10Andrew Bogott: "Note that this needs to be deployed in a staged approach, as detailed in the associated bug." [puppet] - 10https://gerrit.wikimedia.org/r/289081 (https://phabricator.wikimedia.org/T135447) (owner: 10Andrew Bogott) [22:01:12] (03CR) 10Andrew Bogott: "Note that this needs to be deployed in a staged approach, as detailed in the associated bug." [dns] - 10https://gerrit.wikimedia.org/r/289080 (https://phabricator.wikimedia.org/T135447) (owner: 10Andrew Bogott) [22:07:12] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80345 MB (15% inode=99%) [22:09:12] (03PS1) 10GWicke: Forward x-client-ip & user-agent to AQS [puppet] - 10https://gerrit.wikimedia.org/r/289092 [22:14:47] (03PS1) 10Dzahn: toolserver: incl acme-challenge in :443 virtual host [puppet] - 10https://gerrit.wikimedia.org/r/289093 (https://phabricator.wikimedia.org/T134798) [22:15:48] (03PS2) 10Dzahn: toolserver: incl acme-challenge in :443 virtual host [puppet] - 10https://gerrit.wikimedia.org/r/289093 (https://phabricator.wikimedia.org/T134798) [22:16:17] (03CR) 10Dzahn: [C: 032] toolserver: incl acme-challenge in :443 virtual host [puppet] - 10https://gerrit.wikimedia.org/r/289093 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [22:27:42] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [22:29:20] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2299300 (10Dzahn) {F4021144} [22:33:24] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2299314 (10Dzahn) works now after the fixes above :) all details were in P3110 [22:33:37] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2299315 (10Dzahn) 05Open>03Resolved [22:36:10] fwiw, toolserver.org gets a ton of hits for osm tiles, i suppose that should be changed somewhere to wmflabs.org p858snake|L_: seems that would be a duplicate of https://phabricator.wikimedia.org/T103272 [22:44:12] and the bug is in a 3rd party app that wont read our bugs [22:46:07] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2299349 (10ssastry) Separately, we should check if puppet has hardcoded references to upstart + whether we need to make any updates to production puppet code wrt service restarts. @mo... [22:49:34] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:00:04] RoanKattouw ostriches Krenair Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160516T2300). [23:00:04] MatmaRex ebernhardson RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] \o [23:00:22] hi [23:00:24] Hey [23:00:43] I can do today's SWAT [23:01:00] for mine config goes first, then the two wikimedia events patches [23:03:09] OK [23:03:14] MatmaRex: Any order dependency for yours? [23:04:16] RoanKattouw: no, and they're both no-ops [23:04:44] (03CR) 10Catrope: [C: 032] Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 (owner: 10EBernhardson) [23:04:47] OK [23:05:31] so many reverts heh :D [23:05:43] that's a standard re-deploy :P [23:06:00] (03PS4) 10Catrope: Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 (owner: 10EBernhardson) [23:06:06] (03CR) 10Catrope: [C: 032] Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 (owner: 10EBernhardson) [23:06:09] AARGH ff-only [23:07:14] (03Merged) 10jenkins-bot: Revert "Revert "A/B/C test of control vs textcat vs accept-lang + textcat"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288546 (owner: 10EBernhardson) [23:08:59] !log catrope@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT (duration: 00m 25s) [23:09:03] (03CR) 10Catrope: [C: 032] Prepare Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287959 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [23:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:36] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 36s) [23:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:48] ebernhardson: That's your config change done [23:10:12] RoanKattouw: looks to be working [23:10:15] Cool [23:10:33] (03PS3) 10Catrope: Prepare Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287959 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [23:10:38] (03CR) 10Catrope: [C: 032] Prepare Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287959 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [23:12:02] (03Merged) 10jenkins-bot: Prepare Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287959 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [23:13:23] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 25s) [23:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:48] !log catrope@tin Synchronized wmf-config/CommonSettings.php: SWAT (duration: 00m 25s) [23:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:04] RoanKattouw: sigh, it looks like i need to prep another patch that will require message sync out though ... someone changed WikimediaMessages but didn't do everything correctly... [23:14:13] OK [23:14:22] I have to go soon so would you mind doing that yourself after I'm done? [23:14:30] MatmaRex: Your config change is out [23:14:34] deployed, I mean [23:14:36] RoanKattouw: yea i can [23:14:47] * ebernhardson glars at James_F [23:14:54] RoanKattouw: thanks. it is a no-op right now, the code using it will be in next wmf.N [23:15:01] OK [23:17:27] UW and WE changes going out now [23:17:44] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/UploadWizard: SWAT (duration: 00m 28s) [23:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:11] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/WikimediaEvents: SWAT (duration: 00m 26s) [23:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:31] RoanKattouw: thanks. that also should've been a no-op, because i worked around that problem on-wiki on Friday. but i'll clean up the hack now :) [23:18:36] Cool [23:18:46] ebernhardson: And that ---^^ is your WikimediaEvents stuff [23:18:54] Now waiting for my Echo change to clear Jenkins and then I'll be done [23:19:05] RoanKattouw: sweet, thanks. i'll watch the kafka channels to see if it's coming in right [23:26:27] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/includes/EmailFormatter.php: Fix unsubstituted message in emails (duration: 00m 25s) [23:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:11] (03CR) 10GWicke: "Depends on https://github.com/wikimedia/hyperswitch/pull/41." [puppet] - 10https://gerrit.wikimedia.org/r/289092 (owner: 10GWicke) [23:30:14] ebernhardson: OK I'm all done, feel free to do your scap thing [23:30:18] RoanKattouw: ok thanks [23:35:35] (03PS1) 10Dzahn: planet: node regex to cover 2001 in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/289108 (https://phabricator.wikimedia.org/T134507) [23:35:58] (03PS2) 10Dzahn: planet: node regex to cover 2001 in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/289108 (https://phabricator.wikimedia.org/T134507) [23:37:30] !log ebernhardson@tin Synchronized php-1.28.0-wmf.1/extensions/WikimediaMessages/WikimediaMessages.php: re-add interwiki search result messages (duration: 00m 25s) [23:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:45] !log ebernhardson@tin scap sync-l10n completed (1.28.0-wmf.1) (duration: 00m 34s) [23:38:45] (03PS3) 10Dzahn: planet: node regex to cover 2001 in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/289108 (https://phabricator.wikimedia.org/T134507) [23:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:14] (03PS1) 10Bartosz Dziewoński: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) [23:40:54] * ebernhardson was hopefull sync-l10n would do the trick...but no such luck :( [23:41:20] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Do not merge yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [23:42:04] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80700 MB (15% inode=99%) [23:42:39] (03PS1) 10Dzahn: add planet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/289110 (https://phabricator.wikimedia.org/T134507) [23:42:44] !log ebernhardson@tin Started scap: Full scap to sync out WikimediaMessages update [23:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:02] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80486 MB (15% inode=99%) [23:49:54] (03PS2) 10Dzahn: add planet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/289110 (https://phabricator.wikimedia.org/T134507) [23:50:02] (03CR) 10Dzahn: [C: 032] add planet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/289110 (https://phabricator.wikimedia.org/T134507) (owner: 10Dzahn)