[00:53:24] (03PS1) 10Ori.livneh: Add explanatory comments to CORS-related VCL for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/294016 [00:53:27] (03PS1) 10Ori.livneh: Drop vestige of SPDY support detection from VCL [puppet] - 10https://gerrit.wikimedia.org/r/294017 [00:53:28] (03PS1) 10Ori.livneh: Make upload.wikimedia.org cookie-free [puppet] - 10https://gerrit.wikimedia.org/r/294018 (https://phabricator.wikimedia.org/T137609) [01:00:09] (03CR) 10Ori.livneh: [C: 031] Allow float result for int/int division in gmond's memcached module. [puppet] - 10https://gerrit.wikimedia.org/r/290933 (owner: 10Elukey) [01:50:36] PROBLEM - puppet last run on graphite2002 is CRITICAL: CRITICAL: puppet fail [02:16:53] RECOVERY - puppet last run on graphite2002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:17:32] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2375204 (10Danny_B) [02:28:15] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 13m 02s) [02:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:39] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2375218 (10Danny_B) Gitblit `/raw/...` paths are not functional in repos containing slash (ie. https://git.wikimedia.... [03:12:12] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2375236 (10Danny_B) How to deal with `/zip/` action? AFAICS Diffusion links all zip and gz files from Github, bzip2... [03:16:05] mutante|away: please check my last two posts in T137224, thanks [03:16:06] T137224: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224 [03:55:58] (03CR) 10BBlack: [C: 031] Add explanatory comments to CORS-related VCL for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/294016 (owner: 10Ori.livneh) [03:56:33] (03CR) 10BBlack: [C: 031] Drop vestige of SPDY support detection from VCL [puppet] - 10https://gerrit.wikimedia.org/r/294017 (owner: 10Ori.livneh) [03:57:19] (03CR) 10BBlack: [C: 031] Make upload.wikimedia.org cookie-free [puppet] - 10https://gerrit.wikimedia.org/r/294018 (https://phabricator.wikimedia.org/T137609) (owner: 10Ori.livneh) [04:46:12] (03PS1) 10KartikMistry: aperium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/294020 (https://phabricator.wikimedia.org/T107306) [05:08:35] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.821 second response time [05:10:44] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.592 second response time [05:19:35] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2375269 (10KartikMistry) [05:39:15] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 11.172 second response time [05:41:14] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.411 second response time [05:44:42] (03PS1) 10Giuseppe Lavagetto: mediawiki: correctly assign the new codfw appservers in puppet [puppet] - 10https://gerrit.wikimedia.org/r/294023 (https://phabricator.wikimedia.org/T135466) [05:50:04] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:50:25] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:50:25] PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:45] PROBLEM - puppet last run on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:45] PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:55] PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:51:05] PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:06] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:15] PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:15] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.693 second response time [05:54:24] RECOVERY - configured eth on mw1131 is OK: OK - interfaces up [05:54:35] RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full [05:54:45] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [05:54:55] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:55:04] RECOVERY - DPKG on mw1131 is OK: All packages OK [05:55:05] RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:55:55] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 72950 bytes in 0.285 second response time [05:56:35] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:25:55] (03PS2) 10Muehlenhoff: Enable base::firewall for osmium [puppet] - 10https://gerrit.wikimedia.org/r/293717 [06:30:05] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [06:30:41] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2375299 (10Paladox) @Danny_B we added support for downloading zip and gz files in diffusion. It downloads from GitHu... [06:30:55] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:45] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:47] !log oblivian@palladium conftool action : set/pooled=yes; selector: name=mw126.* [06:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:32:53] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:53] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for osmium [puppet] - 10https://gerrit.wikimedia.org/r/293717 (owner: 10Muehlenhoff) [06:36:22] PROBLEM - Check size of conntrack table on mw1291 is CRITICAL: Timeout while attempting connection [06:36:51] PROBLEM - DPKG on mw1291 is CRITICAL: Timeout while attempting connection [06:37:03] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - 530 bytes in 0.047 second response time [06:37:11] PROBLEM - Disk space on mw1291 is CRITICAL: Timeout while attempting connection [06:37:32] PROBLEM - MD RAID on mw1291 is CRITICAL: Timeout while attempting connection [06:37:52] <_joe_> this is me ^^ [06:38:00] <_joe_> I am reimaging the server [06:38:21] PROBLEM - Apache HTTP on mw1291 is CRITICAL: Connection timed out [06:38:21] PROBLEM - configured eth on mw1291 is CRITICAL: Timeout while attempting connection [06:38:41] PROBLEM - dhclient process on mw1291 is CRITICAL: Timeout while attempting connection [06:38:42] PROBLEM - mediawiki-installation DSH group on mw1291 is CRITICAL: Host mw1291 is not in mediawiki-installation dsh group [06:39:11] PROBLEM - nutcracker port on mw1291 is CRITICAL: Timeout while attempting connection [06:39:31] PROBLEM - nutcracker process on mw1291 is CRITICAL: Timeout while attempting connection [06:39:51] PROBLEM - puppet last run on mw1291 is CRITICAL: Timeout while attempting connection [06:40:01] PROBLEM - salt-minion processes on mw1291 is CRITICAL: Timeout while attempting connection [06:42:09] (03PS1) 10Ladsgroup: Add ORES to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294027 (https://phabricator.wikimedia.org/T120923) [06:44:41] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Enable specs on Katotherian service - https://phabricator.wikimedia.org/T137617#2375305 (10Gehel) a:03Gehel [06:47:17] (03PS1) 10Gehel: Enable 'has_spec' on Kartotherian service. [puppet] - 10https://gerrit.wikimedia.org/r/294028 (https://phabricator.wikimedia.org/T137617) [06:55:32] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:55:42] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:57:02] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3669 bytes in 0.036 second response time [06:57:32] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:28] (03PS4) 10WMDE-Fisch: Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [07:00:21] RECOVERY - Apache HTTP on mw1291 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.004 second response time [07:00:42] RECOVERY - dhclient process on mw1291 is OK: PROCS OK: 0 processes with command name dhclient [07:01:11] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [07:01:21] RECOVERY - Disk space on mw1291 is OK: DISK OK [07:01:22] RECOVERY - nutcracker port on mw1291 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:01:33] (03CR) 10WMDE-Fisch: "PS4 is a manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [07:01:42] RECOVERY - nutcracker process on mw1291 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:01:43] RECOVERY - MD RAID on mw1291 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:02:31] RECOVERY - salt-minion processes on mw1291 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:04:02] RECOVERY - Check size of conntrack table on mw1291 is OK: OK: nf_conntrack is 0 % full [07:04:21] RECOVERY - DPKG on mw1291 is OK: All packages OK [07:05:02] RECOVERY - configured eth on mw1291 is OK: OK - interfaces up [07:05:42] RECOVERY - Disk space on lithium is OK: DISK OK [07:07:23] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 11.109 second response time [07:09:22] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.629 second response time [07:09:31] PROBLEM - puppet last run on mw1291 is CRITICAL: CRITICAL: puppet fail [07:14:32] PROBLEM - NTP on mw1291 is CRITICAL: NTP CRITICAL: Offset unknown [07:16:32] RECOVERY - NTP on mw1291 is OK: NTP OK: Offset 0.00417637825 secs [07:19:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [07:25:22] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:25:52] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 2 failures [07:29:41] PROBLEM - Apache HTTP on mw1291 is CRITICAL: Connection refused [07:34:36] RECOVERY - puppet last run on mw1291 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:34:56] RECOVERY - Apache HTTP on mw1291 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.816 second response time [07:46:38] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:55] <_joe_> mw1291 is the new jessie imagescaler I just created [08:00:27] !log oblivian@palladium conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [08:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:00:47] !log oblivian@palladium conftool action : set/pooled=no:weight=20; selector: name=mw1261.eqiad.wmnet [08:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:00:59] !log oblivian@palladium conftool action : set/pooled=no:weight=20; selector: name=mw1262.eqiad.wmnet [08:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:17:09] !log oblivian@palladium conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet [08:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:30] (03PS1) 10Jcrespo: Increase weight of new s5 database servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294029 (https://phabricator.wikimedia.org/T133398) [08:25:28] !log oblivian@palladium conftool action : set/weight=30; selector: name=mw1261.eqiad.wmnet [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:35] (03CR) 10Bmansurov: [C: 031] "Also, we may want to be explicit about "MFDisplayWikibaseDescription" in the "-labs.php" file, rather than inheriting it from the base con" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [08:26:50] (03CR) 10Jcrespo: [C: 032] Increase weight of new s5 database servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294029 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [08:31:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase weight of db1082, db1087, db1092 (duration: 02m 36s) [08:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:21] "ssh: connect to host mw1155.eqiad.wmnet port 22: Connection timed out" [08:32:06] I do not see it down on incinga [08:33:15] (03PS3) 10Muehlenhoff: Stop installing PHP on jessie app servers [puppet] - 10https://gerrit.wikimedia.org/r/291909 [08:33:30] <_joe_> jynus: OOM I guess? [08:33:40] <_joe_> I'll look when I am back [08:33:54] is that really an app? [08:34:24] oh, down for 2 days [08:34:48] the entire host [08:39:03] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:41:40] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2375417 (10Joe) The bug I mentioned earlier was for the `AH01070` and has been... [08:42:35] <_joe_> moritzm: I added you to this bug ^^, it was the tracking bug for the issues at the time [08:42:46] k, thanks [08:42:52] that's me (fro changeprop)c^ [08:43:03] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [08:43:55] !log powercycling mw1155.eqiad.wmnet , unresponsive on ssh, serial console [08:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:52] RECOVERY - Host mw1155 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [08:47:13] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.129 second response time [08:49:53] RECOVERY - HHVM rendering on mw1155 is OK: HTTP OK: HTTP/1.1 200 OK - 72931 bytes in 1.895 second response time [08:50:23] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:51:44] !log removed /var/log/logstash/logstash.log.1 on logstash1001, depleted disk space on the root partition, fallout of T137400 [08:51:45] T137400: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400 [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:52:24] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [08:53:24] RECOVERY - DPKG on logstash1001 is OK: All packages OK [08:53:33] RECOVERY - Disk space on logstash1001 is OK: DISK OK [08:56:44] (03CR) 10Alexandros Kosiaris: [C: 031] "I was given a good answer inline, +1." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [08:59:04] (03PS4) 10Jcrespo: Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [09:00:48] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/293694 (owner: 10Alexandros Kosiaris) [09:01:23] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [09:02:47] Is it just me or anyone get 503 when trying to get a patch for review in differential: [09:02:54] https://usercontent.irccloud-cdn.com/file/Ib3QjpvQ/ [09:05:15] Amir1, are you trying to send a CR to phabricator? [09:09:02] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [09:09:52] jynus: yup [09:10:00] "arc diff" [09:10:37] Amir1, acoording to https://wikitech.wikimedia.org/wiki/Help:Git we use gerrit (https://gerrit.wikimedia.org) for that [09:11:36] jynus: but for some projects it's phab, in this case scap3 uses phab [09:11:50] I did that three or four times before [09:12:50] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:14:16] https://github.com/wikimedia/scap [09:14:51] an example that got landed (=merged) https://phabricator.wikimedia.org/D212 [09:22:30] (03CR) 10JanZerebecki: [C: 04-1] "No that is not what I meant, it is the same package AFAIK. If you declare a resource twice on the same host puppet will throw an error. Ot" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [09:33:48] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 1 failures [09:39:20] (03PS1) 10Jcrespo: Depool db1052 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294035 (https://phabricator.wikimedia.org/T133398) [09:41:57] (03CR) 10Jcrespo: [C: 032] Depool db1052 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294035 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [09:42:03] <_joe_> Amir1: I can take a look at the logs, but it's not exactly something we are supporting right now [09:42:10] <_joe_> (we == techops) [09:42:19] (03PS3) 10Alexandros Kosiaris: service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 [09:42:21] (03CR) 10Muehlenhoff: [C: 032 V: 032] Stop installing PHP on jessie app servers [puppet] - 10https://gerrit.wikimedia.org/r/291909 (owner: 10Muehlenhoff) [09:43:44] _joe_: thanks :) [09:43:48] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1052 for cloning (duration: 00m 26s) [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:28] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 (owner: 10Alexandros Kosiaris) [09:44:33] <_joe_> Amir1: can you try again? [09:44:34] (03PS4) 10Alexandros Kosiaris: service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 [09:44:43] yeah, sure [09:45:24] <_joe_> I'm not even sure I am looking at the right logs :( [09:45:29] _joe_: https://phabricator.wikimedia.org/D262 [09:45:34] it worked now [09:45:35] thanks [09:45:49] clearly, _joe_ fixed it [09:46:20] of course, that's what I'm implying. Sorry if it's not clear enough :) [09:46:46] that is not what I was implying [09:47:50] <_joe_> Amir1: I did exactly nothing :P [09:47:59] <_joe_> I just stared at the logs [09:48:10] I think you may still want to report your problem to releng [09:48:26] <_joe_> it's like when you take the car to the mechanic and suddenly that strange noise goes away [09:48:34] as probably that is in "beta status" [09:48:36] jynus: yeah, let me do that later (most of them based in the u.s.) [09:48:36] <_joe_> and yes, what jynus said [09:49:22] https://www.reddit.com/r/Jokes/comments/2wf5ge/a_mechanical_engineer_electrical_engineer/ [09:49:53] _joe_: ^ [09:50:29] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: puppet fail [09:51:39] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: puppet fail [09:52:08] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: puppet fail [09:52:18] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:03] <_joe_> mw1063, uhm [09:53:08] !log stopping db1052 and cloning it to db1080, db1083 and db1089 [09:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:41] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:00] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:12] <_joe_> checking mw1138 [09:54:19] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:20] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:20] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:48] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:29] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:55:29] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:39] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:56] <_joe_> !log powercycling mw1138, oom, console non-responsive [09:55:58] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: puppet fail [09:55:59] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:36] 06Operations, 07HHVM: Issue rotating hhvm logs - https://phabricator.wikimedia.org/T137689#2375578 (10ema) [09:58:16] (03PS1) 10Muehlenhoff: Only install PHP configuration files on trusty [puppet] - 10https://gerrit.wikimedia.org/r/294037 [09:58:20] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: puppet fail [09:58:49] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 12 processes with command name hhvm [09:58:58] 06Operations, 07HHVM: Issue rotating hhvm logs - https://phabricator.wikimedia.org/T137689#2375578 (10Joe) @ema yes this problem is present only in the new jessie appservers, I plan to fix it ASAP. [09:59:38] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [09:59:39] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [09:59:49] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:59:49] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.235 second response time [10:00:08] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [10:00:18] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 6 % full [10:00:30] RECOVERY - DPKG on mw1138 is OK: All packages OK [10:00:30] RECOVERY - Disk space on mw1138 is OK: DISK OK [10:00:38] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:00:39] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 72957 bytes in 0.280 second response time [10:02:19] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:03:09] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:03:29] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.816 second response time [10:04:48] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/294037 (owner: 10Muehlenhoff) [10:04:58] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: puppet fail [10:05:38] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.387 second response time [10:06:18] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: puppet fail [10:06:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Only install PHP configuration files on trusty [puppet] - 10https://gerrit.wikimedia.org/r/294037 (owner: 10Muehlenhoff) [10:07:09] PROBLEM - puppet last run on mw1291 is CRITICAL: CRITICAL: puppet fail [10:07:12] (03PS1) 10Urbanecm: Add images.nypl.org to $wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294039 (https://phabricator.wikimedia.org/T137687) [10:08:48] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:09:40] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:11:49] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: puppet fail [10:14:39] PROBLEM - DPKG on mw1263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:14:48] <_joe_> moritzm: ^^ [10:14:53] <_joe_> what's up there? [10:15:10] PROBLEM - puppet last run on db2048 is CRITICAL: CRITICAL: Puppet has 1 failures [10:15:59] PROBLEM - DPKG on mw1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:17:30] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:30] (03PS2) 10Giuseppe Lavagetto: mediawiki: correctly assign the new codfw appservers in puppet [puppet] - 10https://gerrit.wikimedia.org/r/294023 (https://phabricator.wikimedia.org/T135466) [10:19:10] (03PS3) 10Alexandros Kosiaris: [Planet Wikimedia] 5 additions to Italian and English planets [puppet] - 10https://gerrit.wikimedia.org/r/294015 (owner: 10Nemo bis) [10:19:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] [Planet Wikimedia] 5 additions to Italian and English planets [puppet] - 10https://gerrit.wikimedia.org/r/294015 (owner: 10Nemo bis) [10:20:00] <_joe_> akosiaris: grrr you merge-sniped me! [10:20:34] (03PS3) 10Giuseppe Lavagetto: mediawiki: correctly assign the new codfw appservers in puppet [puppet] - 10https://gerrit.wikimedia.org/r/294023 (https://phabricator.wikimedia.org/T135466) [10:20:51] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: correctly assign the new codfw appservers in puppet [puppet] - 10https://gerrit.wikimedia.org/r/294023 (https://phabricator.wikimedia.org/T135466) (owner: 10Giuseppe Lavagetto) [10:21:01] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: correctly assign the new codfw appservers in puppet [puppet] - 10https://gerrit.wikimedia.org/r/294023 (https://phabricator.wikimedia.org/T135466) (owner: 10Giuseppe Lavagetto) [10:21:15] having a look [10:22:18] <_joe_> mw1063 is an "alias" from mw1263 [10:22:37] manual puppet run went fine [10:22:37] <_joe_> that I am removing as we speak [10:22:54] <_joe_> uhm maybe a conflict with you uninstalling packages? [10:23:29] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:56] maybe, I manually pruned some of the php5 zend packages to detect if there's still packages pulling in php5- packages (there's a few) [10:23:58] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:27:18] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:27:55] (03Restored) 10Hashar: (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [10:28:09] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:29:59] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix codfw appservers declarations [puppet] - 10https://gerrit.wikimedia.org/r/294041 [10:30:18] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:31:29] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix codfw appservers declarations [puppet] - 10https://gerrit.wikimedia.org/r/294041 (owner: 10Giuseppe Lavagetto) [10:31:40] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:31:41] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: fix codfw appservers declarations [puppet] - 10https://gerrit.wikimedia.org/r/294041 (owner: 10Giuseppe Lavagetto) [10:32:23] (03PS2) 10Hashar: (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [10:32:34] (03PS4) 10Hashar: (DO NOT SUBMIT) contint: pin chromium to 49 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/291116 (https://phabricator.wikimedia.org/T136188) [10:33:22] (03PS3) 10Hashar: (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [10:33:29] RECOVERY - puppet last run on mw1291 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:33:49] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:33:54] (03CR) 10Hashar: "Slight amended and rebased on top of the chromium pin change https://gerrit.wikimedia.org/r/#/c/291116/" [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [10:34:15] hi, I've added a swat to fix a serious regression on the mobileview api that affects iOS apps on todays first swat window, but unfortunately it is patch number 9 [10:35:02] should I remove it? i've been asked to swat it asap given how big the regression is [10:35:11] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160613T1500 [10:37:29] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:39:49] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Puppet has 1 failures [10:41:29] RECOVERY - puppet last run on db2048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:46:41] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:46:57] 06Operations: Trim down further PHP Zend dependencies on app servers - https://phabricator.wikimedia.org/T137696#2375750 (10MoritzMuehlenhoff) [11:02:23] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:24] RECOVERY - DPKG on mw1263 is OK: All packages OK [11:04:23] PROBLEM - Apache HTTP on mw2228 is CRITICAL: Connection refused [11:06:04] PROBLEM - Apache HTTP on mw2221 is CRITICAL: Connection refused [11:07:03] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:53] PROBLEM - mediawiki-installation DSH group on mw2228 is CRITICAL: Host mw2228 is not in mediawiki-installation dsh group [11:09:52] PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group [11:15:35] <_joe_> uhm checking codfw [11:20:24] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [11:22:33] RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.128 second response time [11:26:08] 06Operations: Trim down further PHP Zend dependencies on app servers - https://phabricator.wikimedia.org/T137696#2375849 (10MoritzMuehlenhoff) Actually with the current structure of the packages that's not really fixable: php-mail-mime uses some PEAR PHP classes, but in Debian these are shipped by the php-pear b... [11:27:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: Add the jessie appservers to dsh [puppet] - 10https://gerrit.wikimedia.org/r/294046 [11:29:11] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Add the jessie appservers to dsh [puppet] - 10https://gerrit.wikimedia.org/r/294046 (owner: 10Giuseppe Lavagetto) [11:30:59] <_joe_> !log rolling reboot of the new appservers in codfw + scap pull [11:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:55] (03PS5) 10Muehlenhoff: Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 [11:34:19] 06Operations: Trim down further PHP Zend dependencies on app servers - https://phabricator.wikimedia.org/T137696#2375858 (10MoritzMuehlenhoff) 05Open>03declined [11:36:00] PROBLEM - Apache HTTP on mw2226 is CRITICAL: Connection refused [11:36:40] RECOVERY - mediawiki-installation DSH group on mw2222 is OK: OK [11:42:40] RECOVERY - mediawiki-installation DSH group on mw2227 is OK: OK [11:43:29] RECOVERY - Apache HTTP on mw2231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.877 second response time [11:43:50] RECOVERY - mediawiki-installation DSH group on mw1291 is OK: OK [11:45:01] RECOVERY - mediawiki-installation DSH group on mw2230 is OK: OK [11:45:21] RECOVERY - mediawiki-installation DSH group on mw2224 is OK: OK [11:45:30] RECOVERY - mediawiki-installation DSH group on mw2232 is OK: OK [11:46:51] RECOVERY - mediawiki-installation DSH group on mw2225 is OK: OK [11:47:40] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:40] RECOVERY - mediawiki-installation DSH group on mw2215 is OK: OK [11:50:31] RECOVERY - mediawiki-installation DSH group on mw2216 is OK: OK [11:52:09] RECOVERY - mediawiki-installation DSH group on mw2231 is OK: OK [11:54:15] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2375893 (10fgiunchedi) it looks like this is a duplicate of T136630 ? I'm going to followup there [11:55:40] RECOVERY - mediawiki-installation DSH group on mw2218 is OK: OK [11:56:24] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2375896 (10Gehel) [11:56:45] (03PS2) 10BBlack: Hygiene: Remove refs to ZeroRatedMobileAccess [puppet] - 10https://gerrit.wikimedia.org/r/293888 (owner: 10Mholloway) [11:59:06] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2375898 (10fgiunchedi) a:05fgiunchedi>03RobH thanks @robh, please rack two systems per row where ms-be already exist, namely row A/B/C if I'm not mistaken. Spreading machines across rac... [11:59:55] !log change-prop deployed 54f98b7 [11:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:00] (03PS3) 10BBlack: Hygiene: Remove refs to ZeroRatedMobileAccess [puppet] - 10https://gerrit.wikimedia.org/r/293888 (owner: 10Mholloway) [12:00:36] (03CR) 10BBlack: [C: 032 V: 032] Hygiene: Remove refs to ZeroRatedMobileAccess [puppet] - 10https://gerrit.wikimedia.org/r/293888 (owner: 10Mholloway) [12:00:51] (03PS6) 10Muehlenhoff: Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 [12:04:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [12:05:08] bblack: I'll merge your Hygiene: Remove refs to ZeroRatedMobileAccess change along [12:05:40] moritzm: please, sorry :) [12:05:55] already done :-) [12:05:56] I thought I was going to push a couple more with it, but the next one got complicated :) [12:08:56] RECOVERY - mediawiki-installation DSH group on mw2228 is OK: OK [12:10:56] RECOVERY - mediawiki-installation DSH group on mw2221 is OK: OK [12:10:58] (03PS1) 10BBlack: Zero VCL: remove ZeroTLS header/cookie [puppet] - 10https://gerrit.wikimedia.org/r/294052 [12:12:37] RECOVERY - mediawiki-installation DSH group on mw2229 is OK: OK [12:15:38] (03PS2) 10Muehlenhoff: WIP: Use Yubico OTPs as a second authentication factor for members of the yubiauth group [puppet] - 10https://gerrit.wikimedia.org/r/281630 [12:16:57] RECOVERY - mediawiki-installation DSH group on mw2219 is OK: OK [12:20:57] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [12:23:46] RECOVERY - mediawiki-installation DSH group on mw2223 is OK: OK [12:26:36] RECOVERY - mediawiki-installation DSH group on mw2217 is OK: OK [12:27:26] (03CR) 10BBlack: "Yeah this one gets complicated, we should probably discuss the present and future of X-CS and related stuff from tag_carrier in a ticket.." [puppet] - 10https://gerrit.wikimedia.org/r/293887 (owner: 10Mholloway) [12:28:01] (03PS2) 10BBlack: Add explanatory comments to CORS-related VCL for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/294016 (owner: 10Ori.livneh) [12:28:28] (03CR) 10BBlack: [C: 032 V: 032] Add explanatory comments to CORS-related VCL for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/294016 (owner: 10Ori.livneh) [12:28:55] (03PS2) 10BBlack: Drop vestige of SPDY support detection from VCL [puppet] - 10https://gerrit.wikimedia.org/r/294017 (owner: 10Ori.livneh) [12:29:01] (03CR) 10BBlack: [C: 032 V: 032] Drop vestige of SPDY support detection from VCL [puppet] - 10https://gerrit.wikimedia.org/r/294017 (owner: 10Ori.livneh) [12:30:06] (03PS2) 10BBlack: Make upload.wikimedia.org cookie-free [puppet] - 10https://gerrit.wikimedia.org/r/294018 (https://phabricator.wikimedia.org/T137609) (owner: 10Ori.livneh) [12:30:29] (03CR) 10BBlack: [C: 032 V: 032] Make upload.wikimedia.org cookie-free [puppet] - 10https://gerrit.wikimedia.org/r/294018 (https://phabricator.wikimedia.org/T137609) (owner: 10Ori.livneh) [12:31:26] RECOVERY - mediawiki-installation DSH group on mw2220 is OK: OK [12:32:23] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2375979 (10BBlack) I merged the above, which just un-sets Set-Cookie, but we may want/need to look at this deeper and disabling the setting of the cookies in the first... [12:34:01] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2375991 (10BBlack) (also, all the same probably applies to maps.wm.o tile requests (which is almost all requests there, but not the leaflet/css/js fetches?), which coul... [12:41:59] (03PS2) 10Muehlenhoff: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [12:46:06] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:47:37] (03PS3) 10BBlack: VCL: block 10% insecure post on non-"secure_post" clusters [puppet] - 10https://gerrit.wikimedia.org/r/289205 (https://phabricator.wikimedia.org/T105794) [12:48:59] (03CR) 10BBlack: VCL: block 10% insecure post on non-"secure_post" clusters [puppet] - 10https://gerrit.wikimedia.org/r/289205 (https://phabricator.wikimedia.org/T105794) (owner: 10BBlack) [12:49:15] (03CR) 10BBlack: [C: 032 V: 032] VCL: block 10% insecure post on non-"secure_post" clusters [puppet] - 10https://gerrit.wikimedia.org/r/289205 (https://phabricator.wikimedia.org/T105794) (owner: 10BBlack) [12:54:32] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2376014 (10Krenair) [12:55:32] (03PS1) 10BBlack: VCL syntax fix for 0f6f5be6 [puppet] - 10https://gerrit.wikimedia.org/r/294057 [12:55:53] (03CR) 10BBlack: [C: 032 V: 032] VCL syntax fix for 0f6f5be6 [puppet] - 10https://gerrit.wikimedia.org/r/294057 (owner: 10BBlack) [12:58:26] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:36] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:03:17] akosiaris: around? [13:06:14] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found - https://phabricator.wikimedia.org/T137706#2376063 (10Krenair) [13:06:20] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found - https://phabricator.wikimedia.org/T137706#2376077 (10Krenair) [13:07:28] (03PS1) 10Muehlenhoff: Add ferm rules (and role) for pybal-test [puppet] - 10https://gerrit.wikimedia.org/r/294058 [13:08:04] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found - https://phabricator.wikimedia.org/T137706#2376063 (10Krenair) [13:12:33] PROBLEM - Host mw2215 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:46] <_joe_> that's me ^^ [13:13:54] RECOVERY - Host mw2215 is UP: PING OK - Packet loss = 0%, RTA = 37.15 ms [13:14:34] PROBLEM - Apache HTTP on mw2217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.191 second response time [13:15:43] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.947 second response time [13:15:43] RECOVERY - Apache HTTP on mw2218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.020 second response time [13:16:34] RECOVERY - Apache HTTP on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.463 second response time [13:16:44] RECOVERY - Apache HTTP on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.289 second response time [13:18:04] (03PS2) 10Muehlenhoff: Add ferm rules (and role) for pybal-test [puppet] - 10https://gerrit.wikimedia.org/r/294058 [13:19:23] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules (and role) for pybal-test [puppet] - 10https://gerrit.wikimedia.org/r/294058 (owner: 10Muehlenhoff) [13:20:14] RECOVERY - Apache HTTP on mw2224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.938 second response time [13:20:26] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376097 (10jcrespo) [13:21:35] PROBLEM - Host mw2226 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:04] RECOVERY - Apache HTTP on mw2227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.432 second response time [13:22:33] RECOVERY - Host mw2226 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [13:24:34] RECOVERY - Apache HTTP on mw2228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.857 second response time [13:24:53] PROBLEM - Host mw2220 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:23] RECOVERY - Apache HTTP on mw2226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.572 second response time [13:25:25] RECOVERY - Host mw2220 is UP: PING OK - Packet loss = 0%, RTA = 36.90 ms [13:25:53] RECOVERY - Apache HTTP on mw2225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.644 second response time [13:26:01] <_joe_> oh dear, I'll need to do another round of reboots it seems [13:27:04] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.245 second response time [13:27:06] (03PS2) 10Giuseppe Lavagetto: Enable base::grub::enable_memory_cgroup on the new mw codfw servers. [puppet] - 10https://gerrit.wikimedia.org/r/293752 (owner: 10Elukey) [13:28:53] RECOVERY - Apache HTTP on mw2229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.205 second response time [13:29:14] RECOVERY - Apache HTTP on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.286 second response time [13:30:21] (03PS3) 10Muehlenhoff: Add ferm rules (and role) for pybal-test [puppet] - 10https://gerrit.wikimedia.org/r/294058 [13:30:49] (03PS3) 10Giuseppe Lavagetto: Enable base::grub::enable_memory_cgroup on the new mw codfw servers. [puppet] - 10https://gerrit.wikimedia.org/r/293752 (owner: 10Elukey) [13:31:03] RECOVERY - Apache HTTP on mw2222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.468 second response time [13:33:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Enable base::grub::enable_memory_cgroup on the new mw codfw servers. [puppet] - 10https://gerrit.wikimedia.org/r/293752 (owner: 10Elukey) [13:35:36] _joe_: why not in the role class, based on os_version? [13:35:47] (03CR) 10JanZerebecki: [C: 031] "Looks good to merge. One host that delcared pigz is stat1002.eqiad.wmnet ." [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [13:37:12] <_joe_> paravoid: because I'm not sure we can override a class parameter [13:37:26] <_joe_> I think only resource parameters can be overridden? [13:37:30] <_joe_> I'll test it out [13:44:23] (03PS1) 10BBlack: nginx (1.11.1-1+wmf2) jessie; urgency=medium [software/nginx] (wmf-1.11.1) - 10https://gerrit.wikimedia.org/r/294061 [13:44:37] <_joe_> paravoid: FTR, it's impossible since puppet 2.6 (to override class parameters), see https://tickets.puppetlabs.com/browse/PUP-1367 [13:45:48] yeah, it is [13:46:00] but this augeas stanza could have been a define perhaps, dunno [13:49:01] <_joe_> yeah that's a possibility too [13:49:13] (03CR) 10MarkTraceur: "Obviously this looks fine, just waiting on one thing on the core change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 (owner: 10Bartosz DziewoƄski) [13:49:19] <_joe_> I'll surely improve before we start reimaging older servers [13:53:04] PROBLEM - Apache HTTP on mw2221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.211 second response time [13:53:18] <_joe_> another round of reboots [13:55:14] !log restarting logstash on logstash1001 [13:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:23] RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.902 second response time [13:55:24] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:55:53] PROBLEM - Apache HTTP on mw2227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.194 second response time [13:56:14] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:58:03] RECOVERY - Apache HTTP on mw2227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.132 second response time [13:59:35] PROBLEM - Apache HTTP on mw2225 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.201 second response time [14:01:04] PROBLEM - Apache HTTP on mw2220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.209 second response time [14:01:14] 07Puppet, 10Beta-Cluster-Infrastructure, 10cassandra: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found - https://phabricator.wikimedia.org/T137706#2376158 (10mobrovac) [14:02:02] PROBLEM - Apache HTTP on mw2223 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.195 second response time [14:02:12] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.483 second response time [14:02:13] PROBLEM - nutcracker port on mw2223 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [14:02:43] PROBLEM - nutcracker process on mw2223 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [14:02:53] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:03:22] RECOVERY - Apache HTTP on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.121 second response time [14:03:33] RECOVERY - nutcracker port on mw2223 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:04:14] RECOVERY - nutcracker process on mw2223 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:04:52] RECOVERY - Apache HTTP on mw2225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.023 second response time [14:05:13] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:06:32] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:02] PROBLEM - Apache HTTP on mw2215 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.186 second response time [14:18:44] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:19:12] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.955 second response time [14:21:02] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:21:33] me ^ [14:22:35] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2376227 (10fgiunchedi) a:05fgiunchedi>03RobH thanks @robh, let's allocate 2x in each of row A/B/D, space permitting especially in row A I think @Cmjohnson ? If there's 10G ports avail... [14:22:47] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.1-1+wmf2) jessie; urgency=medium [software/nginx] (wmf-1.11.1) - 10https://gerrit.wikimedia.org/r/294061 (owner: 10BBlack) [14:23:03] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [14:23:04] !log uploaded nginx-1.11.1-1+wmf2 to carbon [14:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:53] RECOVERY - Apache HTTP on mw2232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.565 second response time [14:25:53] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [14:26:57] !log upgrading cp* nginx (and other oustanding minor package updates) [14:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:44] RECOVERY - Apache HTTP on mw2230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.133 second response time [14:29:50] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I'll merge tomorrow" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [14:31:07] (03PS2) 10Gehel: Enable 'has_spec' on Kartotherian service. [puppet] - 10https://gerrit.wikimedia.org/r/294028 (https://phabricator.wikimedia.org/T137617) [14:31:53] (03PS3) 10Muehlenhoff: Use Yubico OTPs as a second authentication factor for members of the yubiauth group on iron [puppet] - 10https://gerrit.wikimedia.org/r/281630 [14:36:02] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:39:54] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:39:54] (03PS4) 10Muehlenhoff: Use Yubico OTPs as a second authentication factor for members of the yubiauth group on iron [puppet] - 10https://gerrit.wikimedia.org/r/281630 [14:44:03] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [14:44:14] (03PS1) 10Ema: zerofetch.py: track successful executions [puppet] - 10https://gerrit.wikimedia.org/r/294063 (https://phabricator.wikimedia.org/T132835) [14:49:32] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 30 failures [14:50:42] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [14:51:38] !log truncate syslog.1 on ms-be2012 [14:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:33] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376258 (10Antigng_) My bot was using /w/index.php?action=raw to fetch the content of each page/redirect at zhwiki, then it will do some simple search/replace/templa... [14:57:21] * James_F waves. [15:00:04] anomie, ostriches, thcipriani, and marktraceur: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160613T1500). Please do the needful. [15:00:04] James_F, Urbanecm, Amir1, and joakino: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:27] Around [15:00:27] o/ [15:00:28] 👍 [15:01:27] I can SWAT today. [15:01:31] joakino: For 294036 you 'just' want that pushed for wmf.5 right? [15:01:49] yep, it's already in master [15:01:56] for .6 i mean [15:02:08] (03PS2) 10Thcipriani: Enable VisualEditor by default for logged-out users on four Wikipedias too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292275 (owner: 10Jforrester) [15:02:18] yes only for wmf.5 James_F [15:02:33] OK. [15:02:35] * James_F fiddles. [15:03:18] oh, mobilefrontend already merged. [15:03:32] OK, I guess I'll get that done since it's already on tin. [15:03:47] thcipriani: Yeah, I think it just needs pushing? BTW, joakino, you shouldn't merge things into deployment branches without immediately deploying. [15:04:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Elukey's comment is correct, that would mean analytics would be getting alarms for restbase, aqs and maps cassandra infrastructure problem" [puppet] - 10https://gerrit.wikimedia.org/r/293916 (https://phabricator.wikimedia.org/T137422) (owner: 10JanZerebecki) [15:04:40] ok sorry James_F, i asked for help and brion did merge it bc i asked [15:04:52] sorry [15:04:52] i forgot about that [15:04:55] * James_F nods. [15:05:05] we were uncertain so we went BOLD [15:05:12] For something this serious I'd have got RelEng's permission to do a weekend production push. [15:05:37] !log reboot ms-be2012 to fix disk ordering T136395 [15:05:37] Speaking selfishly as an iOS app user. ;-) [15:05:38] T136395: ms-be2012.codfw.wmnet: slot=12 dev=sdm failed - https://phabricator.wikimedia.org/T136395 [15:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: fix workers and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293904 (owner: 10Ladsgroup) [15:07:01] (03PS2) 10Ema: zerofetch.py: track successful executions [puppet] - 10https://gerrit.wikimedia.org/r/294063 (https://phabricator.wikimedia.org/T132835) [15:07:09] i didn't know that existed :) [15:07:47] joakino: Officially it doesn't, but g.reg-g is very understanding if you ask nicely (and more seriously, if there's significant user impact). [15:08:36] makes sense [15:09:27] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/MobileFrontend/includes/MobileContext.php: [[gerrit:294036|Do Not strip srcset on API mobileview action]] PART I (duration: 00m 49s) [15:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:15] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/MobileFrontend: [[gerrit:294036|Do Not strip srcset on API mobileview action]] PART II (duration: 00m 38s) [15:10:18] ^ joakino check please [15:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:28] on it [15:11:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292275 (owner: 10Jforrester) [15:11:21] (03PS1) 10Gehel: Add new maps servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) [15:12:01] (03PS1) 10Eevans: Assign cassandra::target_version to '2.1' [puppet] - 10https://gerrit.wikimedia.org/r/294069 (https://phabricator.wikimedia.org/T137706) [15:12:09] (03CR) 10Gehel: [C: 04-1] "-1 to ensure we validate the servers are running fine before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) (owner: 10Gehel) [15:12:42] thcipriani: ok change works as expected, i'm not seeing anything wrong, going to keep poking at it [15:13:01] joakino: kk, thanks for checking :) [15:13:55] hmm, zuul asleep on the job... [15:14:07] \o/ [15:14:55] a "recheck" or something? [15:15:23] 06Operations, 06Discovery, 06Maps, 10Traffic, 13Patch-For-Review: Send traffic to new maps200? servers - https://phabricator.wikimedia.org/T137620#2376339 (10Gehel) Actions required to enable traffic to new servers: # validate new servers are running fine # merge and deploy https://gerrit.wikimedia.org/... [15:15:51] (03Merged) 10jenkins-bot: Enable VisualEditor by default for logged-out users on four Wikipedias too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292275 (owner: 10Jforrester) [15:16:44] just a little sluggish, I guess :\ [15:17:11] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376360 (10Danny_B) `/feed/` links don't seem to have any equivalent in Diffusion (pity!) so for the time being they... [15:17:55] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376361 (10Joe) @Antigng_ you might not have seen anything go wrong, but your bot was accounting for 50% of the uncached requests to our backends or more. It's a cle... [15:18:10] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:292275|Enable VisualEditor by default for logged-out users on four Wikipedias too]] (duration: 00m 24s) [15:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:14] ^ James_F check please [15:19:15] (03PS4) 10Thcipriani: Permission changes in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293737 (https://phabricator.wikimedia.org/T137532) (owner: 10Urbanecm) [15:19:26] thcipriani: Yup, LGTM. Thanks! [15:19:42] James_F: cool, thanks for checking! [15:19:57] verified with apps folk that regression is fixed, thanks James_F thcipriani ! [15:20:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293737 (https://phabricator.wikimedia.org/T137532) (owner: 10Urbanecm) [15:20:19] joakino: Yay. [15:20:27] joakino: awesome :) [15:20:39] (03Merged) 10jenkins-bot: Permission changes in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293737 (https://phabricator.wikimedia.org/T137532) (owner: 10Urbanecm) [15:22:14] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:293737|Permission changes in zhwiki]] (duration: 00m 26s) [15:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:20] ^ Urbanecm check please [15:22:42] (03PS2) 10Thcipriani: Enable transwiki import for la.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293738 (https://phabricator.wikimedia.org/T137547) (owner: 10Urbanecm) [15:23:43] Seems ok. [15:24:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293738 (https://phabricator.wikimedia.org/T137547) (owner: 10Urbanecm) [15:24:42] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376373 (10Danny_B) `/docs/` links don't seem to have any equivalent in Diffusion so for the time being they may be r... [15:24:42] Urbanecm: kk, thanks for checking [15:25:00] (03Merged) 10jenkins-bot: Enable transwiki import for la.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293738 (https://phabricator.wikimedia.org/T137547) (owner: 10Urbanecm) [15:25:14] (03PS1) 10Ema: zerofetch icinga check [puppet] - 10https://gerrit.wikimedia.org/r/294072 (https://phabricator.wikimedia.org/T132835) [15:26:10] (03PS1) 10KartikMistry: apertium-af-nl: New upstream version [debs/contenttranslation/apertium-af-nl] - 10https://gerrit.wikimedia.org/r/294073 (https://phabricator.wikimedia.org/T107306) [15:26:38] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:293738|Enable transwiki import for la.wiktionary]] (duration: 00m 26s) [15:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:47] ^ Urbanecm check please [15:26:54] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376399 (10Danny_B) [15:27:48] I can't because I haven't import permission. I'll ask the author of the request for checking. [15:28:19] Urbanecm: ack. Thank you. [15:30:07] James_F: would assuming all the messages for https://gerrit.wikimedia.org/r/#/c/294054/ exist and that this just needs a sync-dir be a correct assumption? [15:30:25] * James_F checks. [15:30:41] thcipriani: Yeah, it's just a logic bug, the messages exist. [15:30:45] (03PS1) 10BBlack: tlsproxy: use ssl dynamic record sizing [puppet] - 10https://gerrit.wikimedia.org/r/294075 [15:30:49] (No need for a scap.) [15:31:01] James_F: ack, alrighty, syncing :) [15:31:12] (03CR) 10BBlack: "beta caches need nginx package updates first so their config doesn't break..." [puppet] - 10https://gerrit.wikimedia.org/r/294075 (owner: 10BBlack) [15:31:13] Thanks. [15:32:15] thcipriani, can you deploy in this window 10th change? It was originally scheduled in this SWAT but because critical change I rescheduled it for June 15 Morning SWAT. [15:32:31] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/Echo: SWAT: [[gerrit:294054|Use localized weekdays on Special:Notifications]] (duration: 00m 32s) [15:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:41] ^ James_F check please [15:33:06] * James_F does so. [15:33:47] (03CR) 10BBlack: [C: 031] zerofetch.py: track successful executions [puppet] - 10https://gerrit.wikimedia.org/r/294063 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:33:53] Urbanecm: I'm sorry, I don't follow: 10th change? [15:33:58] (03PS1) 10Ema: zerofetch: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/294077 (https://phabricator.wikimedia.org/T132835) [15:34:15] (03PS4) 10Thcipriani: Enable VE in NS_PROJECT in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291957 (https://phabricator.wikimedia.org/T136628) (owner: 10Urbanecm) [15:34:45] (03CR) 10Ema: [C: 032 V: 032] zerofetch.py: track successful executions [puppet] - 10https://gerrit.wikimedia.org/r/294063 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:35:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291957 (https://phabricator.wikimedia.org/T136628) (owner: 10Urbanecm) [15:35:02] (03CR) 10BBlack: [C: 031] zerofetch icinga check [puppet] - 10https://gerrit.wikimedia.org/r/294072 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:35:12] Yes. If I'm counting right, in this SWAT there are 9 changes, so my last one will be 10th if you'll deploy it :). [15:35:32] thcipriani: Yup, LGTM. [15:35:36] (03Merged) 10jenkins-bot: Enable VE in NS_PROJECT in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291957 (https://phabricator.wikimedia.org/T136628) (owner: 10Urbanecm) [15:35:39] James_F: thanks for checking :) [15:35:47] (03CR) 10Ema: [C: 032 V: 032] zerofetch icinga check [puppet] - 10https://gerrit.wikimedia.org/r/294072 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:35:49] (03CR) 10BBlack: [C: 031] zerofetch: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/294077 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:36:10] Urbanecm: oh! right. yeah, we're making pretty good time so far, I think all should go in this window. [15:36:10] thcipriani ^ [15:36:12] (03CR) 10BBlack: [C: 031] Add new maps servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) (owner: 10Gehel) [15:36:22] Thanks thcipriani . [15:36:29] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=12 dev=sdm failed - https://phabricator.wikimedia.org/T136395#2376453 (10fgiunchedi) 05Open>03Resolved replaced sdm, raid rebuilt [15:36:40] Should I move it to this window or only remove it from next one? [15:36:59] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2376471 (10fgiunchedi) @papaul news on this replacement? thanks! [15:37:54] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:291957|Enable VE in NS_PROJECT in cswiki]] (duration: 00m 25s) [15:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:02] ^ Urbanecm check please [15:38:33] (03CR) 10Mholloway: "I added @Jhobs as a reviewer since he has deeper background on this stuff than I do. There are still a couple of checks for ZeroOpts in t" [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [15:38:55] Works. [15:38:58] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376481 (10Danny_B) `/patch/` links have pretty similar equivalent in Diffusion https://git.wikimedia.org/patch/media... [15:39:00] (03PS2) 10Thcipriani: Add images.nypl.org to $wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294039 (https://phabricator.wikimedia.org/T137687) (owner: 10Urbanecm) [15:39:03] (03PS2) 10Ema: zerofetch: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/294077 (https://phabricator.wikimedia.org/T132835) [15:39:48] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294039 (https://phabricator.wikimedia.org/T137687) (owner: 10Urbanecm) [15:40:02] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376483 (10Danny_B) Links like https://phabricator.wikimedia.org/rMW471ab05ea26bd1c844237bc752043536d9d2c284 are not... [15:40:38] (03Merged) 10jenkins-bot: Add images.nypl.org to $wgCopyUploadsDomains for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294039 (https://phabricator.wikimedia.org/T137687) (owner: 10Urbanecm) [15:42:47] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294039|Add images.nypl.org to $wgCopyUploadsDomains for commons]] (duration: 00m 24s) [15:42:50] ^ Urbanecm check please [15:43:10] thcipriani: for my patches, I think they should be done in one go [15:43:12] (03PS2) 10Thcipriani: Add ORES to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294027 (https://phabricator.wikimedia.org/T120923) (owner: 10Ladsgroup) [15:43:28] and we need to run two maintenance scripts afterwards [15:43:36] extensions/ORES/maintenenace/CheckModelVersions.php [15:43:36] extensions/ORES/maintenenace/PopulateDatabase.php [15:43:41] thcipriani, I'll ask for it, I haven't permission to use tools for importing. [15:43:50] Urbanecm: ack, thanks [15:44:13] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:44:17] the first one takes about a second, the second one should take about two-three minutes [15:44:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294027 (https://phabricator.wikimedia.org/T120923) (owner: 10Ladsgroup) [15:44:59] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2376501 (10fgiunchedi) a:05faidon>03fgiunchedi I'm taking this since swift logging has caused problems in the past and I'm addressing it also as part of swift on jessie [15:45:06] (03Merged) 10jenkins-bot: Add ORES to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294027 (https://phabricator.wikimedia.org/T120923) (owner: 10Ladsgroup) [15:45:08] (03CR) 10Ema: [C: 032 V: 032] zerofetch: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/294077 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:46:10] Amir1: hmm, so WRT to syncing these changes, it seems like it should go extension-list, InitialiseSettings.php, then CommonSettings.php to prevent any logging problems. Is there something I'm missing? [15:46:32] (03PS2) 10Alex Monk: Assign cassandra::target_version to '2.1' [puppet] - 10https://gerrit.wikimedia.org/r/294069 (https://phabricator.wikimedia.org/T137706) (owner: 10Eevans) [15:46:37] Amir1: that is, why should they sync all in one go? [15:46:59] no, you're right [15:47:01] (03PS3) 10Thcipriani: Enable ORES on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269478 (https://phabricator.wikimedia.org/T120923) (owner: 10Reedy) [15:47:10] sorry :) [15:47:27] * Amir1 is super-excited :D [15:47:37] Amir1: np. Just wanted to double-check my thinking: it's early monday for me :) [15:47:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269478 (https://phabricator.wikimedia.org/T120923) (owner: 10Reedy) [15:48:33] (03Merged) 10jenkins-bot: Enable ORES on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269478 (https://phabricator.wikimedia.org/T120923) (owner: 10Reedy) [15:49:33] Amir1: those maintenance scripts, you said they should run post-sync? [15:49:41] thcipriani: yup [15:53:13] Amir1: mwscript extensions/ORES/maintenance/CheckModelVersions.php fawiki ← look right to you (/me gets everything setup to run quickly) [15:53:24] yup [15:53:29] thcipriani: ^ [15:53:44] Amir1: do those scripts need to be run in any particular order? [15:54:03] thcipriani: yup, the first one has to be checkmodelversion [15:54:21] ack. OK, I'll sync then run the scripts, here goes :) [15:54:47] * Amir1 desperately want to pray, but he can't [15:54:50] :D [15:56:45] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2376552 (10Gehel) Starting kartotherian as my own user with modified configuration (enable debug level logging, change log fi... [15:57:58] !log thcipriani@tin Synchronized wmf-config/extension-list: SWAT: [[gerrit:294027|Add ORES to extension-list]] (duration: 00m 25s) [15:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:269478|Enable ORES on fawiki]] PART I (duration: 00m 25s) [15:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:06] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:269478|Enable ORES on fawiki]] PART II (duration: 00m 24s) [15:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:12] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 1 failures [16:00:30] Amir1: was there something else to run? https://phabricator.wikimedia.org/P3236 [16:01:16] thcipriani: it seems db is not updated [16:01:25] schema changes hasn't been applied [16:01:42] otherwise we would have fawiki.ores_model [16:02:26] (so it seems you should run mwscript maintenance/update.php) [16:02:29] thcipriani: ^ [16:02:47] I'll create the tables. (update.php shouldn't be run in production :)) [16:03:00] I thought it would be ran automatically while deploying with scap3 [16:03:23] thcipriani: we do have DBA review, if you want it [16:03:44] https://phabricator.wikimedia.org/T137567 [16:04:17] thcipriani: should we wait for you or start without? [16:04:36] greg-g: whoops, coming [16:05:10] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 1 failures [16:05:12] thcipriani: thank you, it took lots of time :) [16:05:19] Amir1: just model? or ores_classification as well? [16:05:25] thcipriani: both [16:05:28] kk [16:07:38] Amir1: running populatedatabase now [16:07:49] first script already ran [16:07:54] thcipriani, is my 10th change in your plan? [16:07:59] nice [16:08:43] maybe I'm hitting cache but it's not in the beta features [16:09:21] Amir1: https://phabricator.wikimedia.org/P3237 [16:09:42] Urbanecm: Looks like we ran a bit over, can I push that to a different deployment window? [16:09:59] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376617 (10Danny_B) [16:10:03] thcipriani: okay [16:10:06] thanks [16:10:10] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 235 seconds ago with 0 failures [16:10:26] Amir1: can I just leave it as-is? Or do I need to revert anything? [16:10:42] as is would be good [16:10:50] until the evening swat [16:10:53] Amir1: ack. [16:11:02] I'll revert my rescheduling for today. It can be done on June 15. [16:12:29] Urbanecm: ack. Thank you. Sorry my timing estimate was off :\ [16:12:56] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10akosiaris) https://gerrit.wikimedia.org/r/291819 and friends should be settings the grounds for mixing this mess finally [16:16:52] (03PS1) 10KartikMistry: apertium-ca-it: Rebuild for Jessie [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/294080 [16:16:54] blerg. I need to run a full scap since ores wasn't in extension-list until just now. /me does the needful. [16:17:51] !log thcipriani@tin Started scap: Update l10n cache for ores [16:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:34] (03PS4) 10Jforrester: Enable VisualEditor by default on eleven Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) [16:19:40] (03PS2) 10KartikMistry: apertium-ca-it: Rebuild for Jessie [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/294080 [16:20:15] (03CR) 10Jforrester: [C: 031] "Due for SWAT tomorrow morning (22.5 hours' time), as posted on the VPs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) (owner: 10Jforrester) [16:20:51] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [16:28:58] (03PS1) 10BBlack: VCL: do not include labs instances in wikimedia_nets [puppet] - 10https://gerrit.wikimedia.org/r/294083 [16:29:11] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2376658 (10mmodell) Really nice progress! Thank you everyone for pitching in to help. [16:36:31] PROBLEM - MegaRAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [16:36:40] PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error [16:40:09] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 12 failures [16:40:09] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: Puppet has 36 failures [16:42:31] frack? [16:42:39] think so [16:42:58] yep: tellurium.frack.eqiad.wmnet. 3600 IN A 10.64.40.34 [16:43:01] <_joe_> yes both frack hosts [16:43:36] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2376676 (10faidon) @brion, any news? [16:44:22] (03CR) 10Alex Monk: "Is varnish in labs still going to work?" [puppet] - 10https://gerrit.wikimedia.org/r/294083 (owner: 10BBlack) [16:45:09] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 12 failures [16:45:09] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:45:35] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2376681 (10brion) Seems ok lately, haven't noticed any problems last week. [16:45:50] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:00] ^^^ beryllium, tellurium I'm aware of -- updating the puppetmaster [16:46:29] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:46:39] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:45] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2343755 (10yuvipanda) Through the proxy or the public IP? [16:47:01] PROBLEM - SSH on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:10] PROBLEM - nutcracker port on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:19] PROBLEM - DPKG on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:39] PROBLEM - configured eth on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:50] PROBLEM - dhclient process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:53] all my grafana graphs went empty, graphite issues? [16:48:09] PROBLEM - nutcracker process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:09] PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:20] PROBLEM - Check size of conntrack table on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:25] hmmmm, not all of them.... [16:48:51] ok data came back, I guess it was temporary [16:49:29] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:56] !log thcipriani@tin Finished scap: Update l10n cache for ores (duration: 32m 04s) [16:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:09] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 12 failures [16:50:19] ^ Amir1 check l10n for ORES please [16:50:21] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:21] PROBLEM - HHVM processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:22] (03PS1) 10Andrew Bogott: Don't specify kernel version for Jessie image anymore. [puppet] - 10https://gerrit.wikimedia.org/r/294087 [16:50:34] thcipriani: sure [16:51:39] thcipriani: okay, the legend is okay, the CSS module is not being invoked (?), and it is not shown up in the beta features [16:51:44] <_joe_> !log powercycling mw1115 [16:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:21] (03CR) 10Andrew Bogott: [C: 032] Don't specify kernel version for Jessie image anymore. [puppet] - 10https://gerrit.wikimedia.org/r/294087 (owner: 10Andrew Bogott) [16:54:10] RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:54:11] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:54:20] RECOVERY - HHVM processes on mw1115 is OK: PROCS OK: 6 processes with command name hhvm [16:54:21] RECOVERY - Disk space on mw1115 is OK: DISK OK [16:54:29] RECOVERY - Check size of conntrack table on mw1115 is OK: OK: nf_conntrack is 0 % full [16:55:09] RECOVERY - check_puppetrun on beryllium is OK: OK: Puppet is currently enabled, last run 122 seconds ago with 0 failures [16:55:09] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 12 failures [16:55:19] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [16:55:20] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [16:55:30] RECOVERY - DPKG on mw1115 is OK: All packages OK [16:55:36] thcipriani: so it can wait until we see what's going on. :) [16:55:50] RECOVERY - configured eth on mw1115 is OK: OK - interfaces up [16:56:00] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [16:56:00] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.058 second response time [16:56:20] Amir1: :(( FWIW, all patches should be sync'd, l10n should be up-to-date since ORES is in the extension-list and scap was run [16:56:57] yeah, I checked the replica but ores tables wasn't there [16:56:59] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 72425 bytes in 1.335 second response time [16:58:50] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160613T1700). Please do the needful. [17:00:10] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 12 failures [17:01:41] RECOVERY - Disk space on ms-be2003 is OK: DISK OK [17:01:48] nothing planned for WDQS deployment. SMalyshev ping me if that's not the case... [17:01:48] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2376702 (10brion) Either. You can have the IP back, I guess, doesn't seem to make any difference. [17:05:10] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 12 failures [17:06:35] (03PS1) 10Yuvipanda: graphite: Use mod_proxy for proxying [puppet] - 10https://gerrit.wikimedia.org/r/294091 [17:06:48] godog ^ what do you think? [17:08:25] yuvipanda: in a meeting but I've added myself to the review for later! [17:09:00] godog thanks! I think it should be a no-op (labmon1001 ran with that config for a few days) [17:10:09] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 12 failures [17:11:11] (03CR) 10Kaldari: [C: 032] Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [17:12:17] PROBLEM - MariaDB disk space on db1089 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%) [17:13:16] jynus volans ^? [17:13:31] PROBLEM - Disk space on db1089 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%) [17:13:45] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367544 (10RobH) Allowing sc-admins the ability for: firejail --join was approved in the operations meeting today. If the patch https://gerr... [17:14:02] yuvipanda, just jynus [17:14:04] (03PS2) 10BBlack: VCL: do not trust labs like we do prod [puppet] - 10https://gerrit.wikimedia.org/r/294083 [17:14:12] I think I decompressed on the wrong partition [17:14:20] ah [17:14:22] ok :) [17:14:29] yeah I see /srv almost empty [17:14:47] my apologies [17:15:06] (03CR) 10Alex Monk: VCL: do not trust labs like we do prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294083 (owner: 10BBlack) [17:15:09] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 207 seconds ago with 0 failures [17:15:40] RECOVERY - Disk space on db1089 is OK: DISK OK [17:16:11] 06Operations, 10hardware-requests: Site: 1) hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2376734 (10yuvipanda) [17:16:22] 06Operations, 10hardware-requests: Site: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2376746 (10yuvipanda) [17:16:37] RECOVERY - MariaDB disk space on db1089 is OK: DISK OK [17:17:01] (03CR) 10Mobrovac: [C: 031] Allow firejail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [17:17:04] (03CR) 10BBlack: VCL: do not trust labs like we do prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294083 (owner: 10BBlack) [17:17:24] 06Operations, 10hardware-requests: Site: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2376734 (10yuvipanda) [17:21:25] (03CR) 10Alex Monk: VCL: do not trust labs like we do prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294083 (owner: 10BBlack) [17:22:23] 06Operations, 10hardware-requests: Site: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2376775 (10RobH) a:03RobH [17:24:10] (03PS1) 10RobH: adding user joewalsh to cluster access [puppet] - 10https://gerrit.wikimedia.org/r/294093 (https://phabricator.wikimedia.org/T137110) [17:24:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [17:25:08] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2376817 (10RobH) 05Open>03stalled a:03RobH I'm on clinic duty this week, and this has a 3 day wait for any objections to be noted. I've... [17:26:09] (03PS4) 10RobH: Allow firejail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [17:26:29] (03PS3) 10BBlack: VCL: do not trust labs like we do prod [puppet] - 10https://gerrit.wikimedia.org/r/294083 [17:26:33] its really nice when the access requests filed have obviously read the process and already did all the required steps. [17:26:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5303837 keys - replication_delay is 0 [17:27:10] (03CR) 10Filippo Giunchedi: [C: 031] "two nits, LGTM otherwise" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294091 (owner: 10Yuvipanda) [17:27:19] yuvipanda: ^ [17:27:42] (03CR) 10RobH: [C: 032] Allow firejail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [17:28:09] godog haha I keep fucking up recommended :D [17:28:21] I'm getting better because I'm writing a lot of dockerfiles with --no-install-recommends :D [17:28:38] godog I'm going to fix those nits and merge it now - do you tink you can stay around for ~10mins to make sure it doesn't blow up? [17:28:50] yuvipanda: lol I learned to spell it the same way with Recommends: in debian/control [17:29:00] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2376830 (10RobH) 05Open>03Resolved a:03RobH @mobrovac +1'd the patchset. As its been approved in the ops meeting, it is now merged liv... [17:29:02] :D [17:29:13] yuvipanda: yup I'm around for another 20m or so [17:29:19] (03PS4) 10BBlack: VCL: do not trust labs like we do prod [puppet] - 10https://gerrit.wikimedia.org/r/294083 [17:29:29] (03CR) 10BBlack: [C: 032 V: 032] VCL: do not trust labs like we do prod [puppet] - 10https://gerrit.wikimedia.org/r/294083 (owner: 10BBlack) [17:29:45] (03PS2) 10Yuvipanda: graphite: Use mod_proxy for proxying [puppet] - 10https://gerrit.wikimedia.org/r/294091 [17:29:48] godog ^ ? [17:30:42] (03CR) 10Filippo Giunchedi: [C: 031] graphite: Use mod_proxy for proxying [puppet] - 10https://gerrit.wikimedia.org/r/294091 (owner: 10Yuvipanda) [17:30:44] aye, looks good [17:31:26] bleh, VCL syntax fail again, twice in one day [17:31:34] (03PS3) 10Yuvipanda: graphite: Use mod_proxy for proxying [puppet] - 10https://gerrit.wikimedia.org/r/294091 [17:32:06] (03CR) 10Yuvipanda: [C: 032 V: 032] graphite: Use mod_proxy for proxying [puppet] - 10https://gerrit.wikimedia.org/r/294091 (owner: 10Yuvipanda) [17:32:47] (03PS1) 10BBlack: VCL syntax fix for 000f93e3 [puppet] - 10https://gerrit.wikimedia.org/r/294095 [17:32:52] godog forcing on graphite1001 [17:32:53] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA request for @WMDE-leszek - https://phabricator.wikimedia.org/T133145#2376836 (10RobH) 05Open>03Resolved [17:33:17] (03CR) 10BBlack: [C: 032 V: 032] VCL syntax fix for 000f93e3 [puppet] - 10https://gerrit.wikimedia.org/r/294095 (owner: 10BBlack) [17:33:25] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA request for @thiemowmde - https://phabricator.wikimedia.org/T135994#2376854 (10RobH) 05Open>03Resolved This seems resolved, and is only lacking the user confirmation they can now login. Pending that somehow not workin... [17:33:35] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2376859 (10RobH) 05Open>03Resolved This seems resolved, and is only lacking the user confirmation they can now login. Pending that somehow not working, I'm resolving this task.... [17:34:49] godog hmm, header in https://graphite.wikimedia.org/ is weird [17:35:15] mhh indeed, I'm checking [17:36:17] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 2 failures [17:37:11] !log Upgrading restbase1007.eqiad.wmnet w/ https://people.wikimedia.org/~eevans/debian/cassandra_2.2.6-wmf1_all.deb : T137474 [17:37:12] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [17:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:57] yuvipanda: running puppet on graphite1003 too but shouldn't change anything [17:37:58] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 2 failures [17:38:00] !log Restarting restbase1007-a.eqiad.wmnet : T137474 [17:38:01] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [17:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:48] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 2 failures [17:41:08] PROBLEM - graphite.wikimedia.org on graphite2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.075 second response time [17:43:15] !log enable proxy_http apache module on graphite1003 / graphite2002 and restart apache [17:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:45] godog hmm those didn't need enabling in other places [17:44:55] should I explicitly enable it anyway [17:44:57] graphite's still borked [17:45:40] bblack: graphs not loading in grafana? [17:46:18] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:46:24] godog: yeah [17:46:29] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:46:37] ïżŒ Included in [17:46:42] heh [17:46:47] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [17:46:58] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:47:05] odd, that dashboard loads for me [17:47:20] (03PS1) 10Yuvipanda: graphite: Enable mod_proxy / mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/294097 [17:47:24] godog ^? [17:47:29] ah ok, if I reload the whole page it loads now [17:47:39] but refreshing data in a page that had been up, was failing with empty graphs [17:48:18] (03CR) 10Filippo Giunchedi: [C: 031] graphite: Enable mod_proxy / mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/294097 (owner: 10Yuvipanda) [17:48:51] bblack: ah, possibly grafana throwing its toys out of the pram if graphite returns funny things [17:48:52] (03PS2) 10Yuvipanda: graphite: Enable mod_proxy / mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/294097 [17:49:04] (03CR) 10Yuvipanda: [C: 032 V: 032] graphite: Enable mod_proxy / mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/294097 (owner: 10Yuvipanda) [17:49:54] yuvipanda: aye, thanks! [17:51:06] godog (IRC): np! this adds a small amount of overhead (re-parsing the HTTP headers) but that should be negligible [17:51:19] and the proxy_uwsgi module is buggy when I checked it out so switched to this instead [17:52:03] yeah getting rid of mod_uwsgi and http-proxying seems sane [17:52:42] !log Restarting restbase1007-b.eqiad.wmnet : T137474 [17:52:43] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [17:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:20] !log Restarting restbase1007-c.eqiad.wmnet : T137474 [17:55:21] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [17:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:45] yuvipanda: mhh there's the occasional 502 from apache and uwsgi has this in the logs, Jun 13 17:54:02 graphite1003 uwsgi-graphite-web[12707]: invalid request block size: 5172 (max 4096)...skip [17:56:11] uh hmm [17:57:08] godog the internet tells me that happens when people use --socket or --http-socket instead of --htp [17:57:11] *http [17:57:13] but we are using --http [17:57:55] Upgrade of restbase1007.eqiad.wmnet (https://people.wikimedia.org/~eevans/debian/cassandra_2.2.6-wmf1_all.deb) complete : T137474 [17:57:55] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [17:57:58] gah [17:58:01] !log Upgrade of restbase1007.eqiad.wmnet (https://people.wikimedia.org/~eevans/debian/cassandra_2.2.6-wmf1_all.deb) complete : T137474 [17:58:01] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [17:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:14] yuvipanda: aye, buffer-size option was the other option I found that might need to be bumped [17:58:19] yeah [17:58:31] shall I bump it or should we wait for a bit? [17:58:52] 06Operations, 10hardware-requests: eqiad: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2376978 (10yuvipanda) [17:59:41] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2376983 (10RobH) [17:59:41] yuvipanda: yeah I'd say bump it to 16k or so perhaps [17:59:48] RECOVERY - graphite.wikimedia.org on graphite2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1842 bytes in 0.325 second response time [17:59:57] ok [18:01:53] (03PS1) 10Yuvipanda: graphite: bump up buffer-size [puppet] - 10https://gerrit.wikimedia.org/r/294098 [18:01:57] godog (IRC): ^ [18:02:45] (03CR) 10jenkins-bot: [V: 04-1] graphite: bump up buffer-size [puppet] - 10https://gerrit.wikimedia.org/r/294098 (owner: 10Yuvipanda) [18:03:29] (03PS2) 10Yuvipanda: graphite: bump up buffer-size [puppet] - 10https://gerrit.wikimedia.org/r/294098 [18:03:33] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM despite jenkins' failure" [puppet] - 10https://gerrit.wikimedia.org/r/294098 (owner: 10Yuvipanda) [18:04:21] godog jenkins was just rebase failure [18:04:39] (03CR) 10Yuvipanda: [C: 032 V: 032] graphite: bump up buffer-size [puppet] - 10https://gerrit.wikimedia.org/r/294098 (owner: 10Yuvipanda) [18:07:20] yuvipanda: LGTM, the error on the graphite web interface is due to REMOTE_USER not found, the exception trace is in the page but other than that seems to be working [18:07:36] a bit late, but WDQS deployment is starting. No dependency / interaction with anything else expected. [18:07:58] godog \o/. I'll look into the REMOTE_USER thing :D [18:08:19] yuvipanda: nice, thanks! I have to run [18:08:36] godog thanks :D [18:11:01] !log deploying latest GUI on WDQS, [18:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:29] 06Operations, 10hardware-requests: eqiad: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2377094 (10RobH) 05Open>03stalled Turns out this chassis is identical to graphite1003, other than the use of SSDs. Task T137738 has been created for the ordering of SSDs. [18:19:35] 06Operations, 10hardware-requests: eqiad: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2377099 (10RobH) [18:20:47] (03PS1) 10Ladsgroup: Add ORES to whitelisted beta features. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) [18:20:53] !log upgrading nginx (etc) on deployment-prep caches [18:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:30] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2377124 (10RobH) Is there a way we can balance things out to make use of all 4 rows? We have row D underutilized at this point. [18:38:58] (03CR) 10MaxSem: [C: 031] Enable 'has_spec' on Kartotherian service. [puppet] - 10https://gerrit.wikimedia.org/r/294028 (https://phabricator.wikimedia.org/T137617) (owner: 10Gehel) [18:40:49] (03CR) 10Jforrester: [C: 04-1] "Not yet; needs checklist https://www.mediawiki.org/wiki/Beta_Features/Package#Release_Requirements confirmation first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [18:41:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [18:42:10] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [18:43:14] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2377134 (10Gehel) service-runner.js is run in the same way in my test and for actual service. Difference might be firejail...... [18:46:26] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2373448 (10GWicke) FWIW, pretty much all node services are running in firejail, and as far as I am aware there have not been... [18:49:28] (03CR) 10Ladsgroup: "1. https://gerrit.wikimedia.org/r/#/q/status:merged+project:mediawiki/extensions/ORES,n,z" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [18:50:30] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [18:56:06] (03CR) 10Ladsgroup: "I put it again:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [18:57:04] 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10ArielGlenn) I'm fine with a daily salt-minion restart but let's make sure that it doesn't leave a duplicate (old) salt-minion running; I've seen this sometimes from puppet thinking... [18:59:53] 06Operations, 10ops-eqiad: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2377184 (10Cmjohnson) [19:00:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [19:00:39] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [19:00:57] 06Operations: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285911 (10Cmjohnson) a:05Cmjohnson>03Gehel @gehel all 4 maps servers are installed and yours for service implementation. I removed the ops-eqiad tag and assigned to you. [19:01:13] (03CR) 10Jforrester: "> 5. https://www.mediawiki.org/wiki/Extension:ORES or we should make another one too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [19:03:03] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [19:05:46] (03CR) 10Yurik: "Actually it was Adam Baso who was working on this the most, adding." [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [19:06:42] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [19:07:43] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [19:07:53] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [19:07:53] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [19:08:13] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [19:09:43] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [19:10:22] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [19:13:22] (03PS1) 10Jdrewniak: T134010 T136874 removing AB test & deploying banner survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294107 [19:14:22] RECOVERY - Disk space on maps1002 is OK: DISK OK [19:15:42] !log aaron@tin Synchronized php-1.28.0-wmf.5/resources: ee2da9c2ae6fac93bf65d17b5ea48e5c47c87d47 (duration: 00m 35s) [19:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:42] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:16:44] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:20:03] RECOVERY - DPKG on maps1002 is OK: All packages OK [19:20:03] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [19:26:32] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:27:03] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:29:34] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:29] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.024 second response time [19:39:09] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [19:42:09] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [19:44:28] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [19:44:48] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:28] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [19:47:18] RECOVERY - DPKG on maps1002 is OK: All packages OK [19:50:39] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.004 second response time [19:51:09] (03CR) 10Ladsgroup: "It's already a flow page now: https://www.mediawiki.org/wiki/Talk:ORES_extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [19:52:29] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [19:54:18] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [19:54:29] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / 
 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160613T2000). [20:00:13] no parsoid deploy today. [20:00:18] no mobileapps deploy today. [20:02:33] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:45] stop complaining labmon we're throwing money at youuuu [20:04:49] (new disks soon) [20:05:07] yuvipanda: labmon1001 hates life. [20:05:39] (03PS2) 10BBlack: tlsproxy: use ssl dynamic record sizing [puppet] - 10https://gerrit.wikimedia.org/r/294075 [20:08:02] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.013 second response time [20:09:41] (03PS3) 10BBlack: tlsproxy: use ssl dynamic record sizing [puppet] - 10https://gerrit.wikimedia.org/r/294075 [20:12:48] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: use ssl dynamic record sizing [puppet] - 10https://gerrit.wikimedia.org/r/294075 (owner: 10BBlack) [20:13:41] (03CR) 10Jforrester: [C: 031] "This is good to go once Id2d9a410e1b is cherry-picked to wmf.5 and deployed everywhere. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [20:24:43] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:03] PROBLEM - HHVM rendering on mw1116 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50408 bytes in 2.384 second response time [20:26:34] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.870 second response time [20:26:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2377421 (10mmodell) @Danny_B: I don't think we should worry about redirecting links to zip files. I think the current... [20:27:03] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 72427 bytes in 0.472 second response time [20:32:36] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2377424 (10Paladox) >>! In T137224#2377421, @mmodell wrote: > @Danny_B: > I don't think we should worry about redirec... [20:32:59] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:52] robh I'm gonna silence it again [20:38:55] i had it silenced for 3d [20:39:08] Just got out of meeting, gonna also kill a lot of statsd metrics to it [20:45:19] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:47:18] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 8.440 second response time [20:54:54] (03PS3) 10Elukey: Assign cassandra::target_version to '2.1' [puppet] - 10https://gerrit.wikimedia.org/r/294069 (https://phabricator.wikimedia.org/T137706) (owner: 10Eevans) [20:57:18] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:58:11] (03CR) 10Elukey: [C: 032] Assign cassandra::target_version to '2.1' [puppet] - 10https://gerrit.wikimedia.org/r/294069 (https://phabricator.wikimedia.org/T137706) (owner: 10Eevans) [20:59:16] urandom: --^ merged! [20:59:47] elukey: sweet! [21:00:08] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2377509 (10mmodell) >>! In T137224#2376483, @Danny_B wrote: > Links like https://phabricator.wikimedia.org/rMW471ab05... [21:02:39] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2377518 (10Krenair) [21:02:44] 07Puppet, 10Beta-Cluster-Infrastructure, 10cassandra, 13Patch-For-Review: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found - https://phabricator.wikimedia.org/T137706#2377515 (10Krenair) 05Open>03Resolved a:03Eevans That seems to have fixed puppet, thanks. [21:03:00] 07Puppet, 10Beta-Cluster-Infrastructure, 10cassandra: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found - https://phabricator.wikimedia.org/T137706#2377521 (10Krenair) [21:05:14] (03PS2) 10Bartosz DziewoƄski: Update cross-wiki upload configuration for I2489004271078a [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 [21:07:34] (03PS3) 10Bartosz DziewoƄski: Update cross-wiki upload configuration for I2489004271078a [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 [21:08:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2377529 (10mmodell) [21:10:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [21:11:42] 06Operations, 10Traffic: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2377539 (10ori) [21:13:58] (03PS1) 10Ori.livneh: Parametrize supplementary response headers in vcl_config [puppet] - 10https://gerrit.wikimedia.org/r/294171 [21:14:16] 06Operations, 10Traffic: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2377557 (10ori) [21:15:08] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 291 seconds ago with 0 failures [21:16:42] thcipriani, hi, any thoughts on tilerator deployments? [21:16:48] via scap3 [21:17:35] yurik: yeah, I have a patch for puppet for it. [21:17:44] ah, awesome, what was wrong? [21:17:54] we could try to pull an opsy in :) [21:18:10] ready whenever: https://gerrit.wikimedia.org/r/#/c/293518/ the problem was that there were two services running from the same directory [21:18:28] so two service::node deinitions. Originally, I only changed one of them. [21:18:45] since they were in conflict, nothing was changing :) [21:20:18] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [21:22:02] thcipriani, well done :) sorry for causing a problem with the two service thing.... eventually i may try to solve it, but its messy :( [21:22:30] eh, I should have been a better code-grepper :P [21:23:07] thcipriani, do you think we can find an ops to do our bidding? [21:23:25] i can do it now with you, but no opsy super powers [21:23:45] I have no super-opsen powers either. [21:23:49] * thcipriani looks a deployment cal [21:24:40] looks like we've got some dead time before the next deploy window. If there is a willing opsen around, I've got time to sit in. [21:26:02] akosiaris, ^ [21:26:22] i will do all the depl, just need someone to +2 a puppet [21:27:16] well, and someone to run puppet on all the targets. [21:28:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5313618 keys - replication_delay is 610 [21:32:56] robh, could you help with it? ^^ [21:36:59] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:37:20] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5286753 keys - replication_delay is 0 [21:38:25] yurik: well, I suppose we could try to schedule something on Wednesday again if gehel is up for it. [21:39:19] gehel is up for it... [21:39:20] yuvipanda, do you have +2 opsen? [21:40:18] gehel up for it now? [21:40:25] i thought you wanted to take some time off :) [21:40:41] yurik: not right now... tomorrow? [21:40:44] I DIDN'T DO IT! [21:40:47] a dog ate my +2? [21:40:51] lol [21:41:06] that sounds legitimate. [21:41:21] yuvipanda, basically we need a puppet +2 and run something magical on all mapsy servers :) [21:41:38] to enable scap3 for maps [21:41:44] we already did it part way [21:41:45] yurik: sorry, i ws afk [21:41:46] whats up? [21:41:51] \o/ [21:42:29] basically looking for a medium-brave (nothing should be too dangerous) opsen to +2 a puppet and run some magic to enable scap3 for thcipriani [21:42:36] robh, ^ [21:43:01] medium brave is an official opsen grade ;) [21:43:17] hrmm, im willing to take a look at the patch and then decide how brave i am =] [21:43:21] but what can I do? [21:43:22] Link to patch? [21:43:26] robh: https://gerrit.wikimedia.org/r/#/c/293518/ [21:43:32] link to patch? [21:43:41] yuvipanda, i thought your dog ate it :D [21:44:11] why would I say such a thing?! [21:44:27] i will take any help from any brave soul :) thcipriani runs the show :) [21:45:13] heh, a.k.a watching for an problems. [21:45:18] *any [21:45:29] why is the change to the service node needed? [21:45:33] nah, you also say what should be done and in what order :) [21:45:36] yurik: so what i can see is this does touch the primary deployment config (the additions to node.pp) but hten also a bunch of stuff that even if it was wrong wouldnt break things [21:45:45] but the tie to main node.pp if wrong breaks stuff right? [21:46:04] im not sure how well i understand our deployment (lie, i know i dont understand it well enough) [21:46:15] * yurik redirects to thcipriani :) [21:46:26] so if this breaks things, it only breaks the ability to deploy. How easily would it roll back? [21:46:40] I think it would only break puppet actually [21:46:56] i don't know anything about it at all, so won't pretend i know things [21:47:09] and ability to deploy (that is, to deploy things that aren't tilerator) shouldn't be affected [21:47:16] ^ [21:47:20] I think is mostly it. [21:47:37] thcipriani if you can amend the commit message with rationale for the change to the service::node, that'll be good I think [21:47:53] * thcipriani does [21:49:09] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [21:51:32] (03PS1) 10Yuvipanda: diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) [21:51:42] thcipriani, please ping me when ready. I'm doing some other minor cleanups [21:52:04] (03PS8) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 [21:52:29] ^ yuvipanda rationale/commit message updated [21:53:20] (03CR) 10Yuvipanda: "I think long term we shuld probably allow the service:: definitions to be more composable and assume less things, but this will do now for" [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [21:53:35] (03PS9) 10Yuvipanda: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [21:53:44] (03CR) 10Yuvipanda: [C: 032 V: 032] Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [21:53:52] yurik: ^ [21:54:25] yepii! [21:54:29] thcipriani, now what? [21:54:37] so im paranod and running in compliler [21:54:47] is this going to affect all scap hosts or is there a specific one to test against? [21:54:53] oh, yuvi just merged nm [21:54:54] ;] [21:55:06] :D [21:55:09] robh, feel free to test against maps200[1-4] [21:55:27] but not maps-test200[1-4] (test is prod, prod is test :) [21:55:29] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [21:55:32] sorry robh ! [21:55:45] * yuvipanda eyes yurik wearily :P [21:55:58] * yurik glares back [21:56:16] * yurik winks at akosiaris [21:56:17] yuvipanda: no worries [21:56:24] yurik: so it looks like scap deploy --init was already run for tilerator, so after puppet runs on the maps machines you should be able to run a deploy [21:56:47] (you may want to double-check the maps boxen to make sure that deploy-service owns /srv/deployment/tilerator now) [21:56:53] hrmm, now to see if the test build will be done before actual deploy, race! [21:56:55] heh [21:56:58] thcipriani the config being in the hiera module for the scap role rather than the deployment role is iffy too - since that ties all scap masters to that set of deploys, but that also is unreltaed to the patch [21:57:17] thcipriani, building the latest versions, sec [21:57:21] thcipriani are you content to wait for the cron or want me to force a run? [21:57:41] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2377737 (10Krenair) ```krenair@tools-bastion-03:~$ host 10.68.17.58 ;; Truncated, retrying in TCP mode. 58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikim... [21:57:50] (03PS2) 10Yuvipanda: diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) [21:57:59] ^ it's 50, before you check [21:58:31] yuvipanda: if you could force a run on the maps200[1-4] and maps-test200[1-4] that would be great. [21:58:51] ok, doing so linearly moment [21:58:56] oh, 20x, codf [21:58:57] kk [22:00:09] PROBLEM - check_mysql on fdb2001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [22:00:26] doing so now, thcipriani (IRC) [22:00:51] thank you! [22:00:52] starting on test 01 to 04 then on maps 01 to 04 [22:02:07] yurik: so after puppet is run, you've got tilerator setup to deploy all targets in 1 stage, restart the service, and check the service port. [22:02:23] awesome! [22:02:31] yuvipanda, robh you rock! [22:02:35] so yeah compiler said no issues [22:02:42] so hopefully real puppet run is equally fine [22:03:28] thcipriani, still building it, npm + gyp are not having a good day :( [22:05:08] PROBLEM - check_mysql on fdb2001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [22:06:23] thcipriani, nah, not worth rebuilding it now, i have a very recent version of tilerator in gerrit - merging it now [22:06:24] https://gerrit.wikimedia.org/r/#/c/294170/ [22:06:36] (03CR) 10jenkins-bot: [V: 04-1] diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [22:06:55] yurik: okie doke. [22:08:02] thcipriani, i also merged https://gerrit.wikimedia.org/r/#/c/293362/ -- can we test it afterwards? [22:08:30] yurik: yup, lgtm [22:09:25] thcipriani, ok, sorry, remind me - i should do scap deploy init + scap deploy -v ? [22:09:37] from tin:/srv/deployment/tilerator/deploy [22:09:40] right? [22:10:17] yurik: you only need to run: scap deploy -v (scap deploy --init was just to make sure puppet was happy on the targets if needed) [22:10:18] PROBLEM - check_mysql on fdb2001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [22:10:32] but we probably need to wait to make sure thep uppet ru n completes [22:10:45] er, we *do* need to wait :) [22:11:13] thcipriani, i just ran the --init, not -v [22:11:19] yurik: scap deploy --init just writes .git/DEPLOY_HEAD [22:11:21] juts to make all the puppets happy :) [22:11:23] ah [22:11:24] :) [22:11:36] won't hurt, right? :) [22:11:41] fdb would be frack - Jeff_Green ^ [22:11:52] yurik: yeah, it's fine. [22:12:04] thcipriani, ok, poke me when to run it :) [22:12:12] kk [22:12:17] thcipriani, can we try kartotherian in the mean time? [22:12:21] since its already enabeld? [22:12:24] ^^ fixing [22:12:46] yurik: I suppose so, sure. [22:13:25] btw [22:13:30] on the non test ones it failed [22:13:40] https://dpaste.de/Xhqh [22:14:10] eww [22:14:27] that's not good. ^ yurik tileratorui service not starting? [22:15:08] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 3490789 Threads: 1 Questions: 43767499 Slow queries: 20728 Opens: 1613 Flush tables: 2 Open tables: 581 Queries per second avg: 12.537 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:17:06] I think it probably just needs first deploy? [22:17:18] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:17:26] thcipriani, it might be masked [22:17:36] i do that when i run it from user shell [22:17:50] that is indeed what it says [22:18:04] don't worry about tilerator not running - i often start and stop it as i try new things [22:18:19] sometimes i let it run "as it is suppose to" [22:18:41] once we enable auto-updating from OSM db, i shouldn't babysit it as much [22:19:25] yurik: evidently caused a puppet failure on maps200[1-4] https://dpaste.de/Xhqh [22:20:16] thcipriani, i can re-enable it now [22:20:52] ack. Just want to make sure there are clean puppet runs with the new patch :) [22:22:55] thcipriani, done, both tilerator and ui are running on all maps [22:22:59] haven't deployed yet [22:23:16] let me know when [22:23:19] yuvipanda: ^ could I get you to re-run puppet on the nodes that failed? [22:23:21] i will depl kartotherian in the mean time [22:23:36] btw, i should probably check if maps-test has tilerator running ok [22:23:52] yeah doing [22:24:02] thx! [22:24:06] thank you [22:24:51] yep, tilerator+ui run on all maps-tests too [22:25:20] thcipriani, do i need to do --init after changing scap configuration? [22:25:40] yurik: it'll do that as part of: scap deploy -v [22:25:47] ok [22:25:56] thcipriani, i'm deploying kartotherian. hold on to your hats [22:26:03] * thcipriani watches [22:26:35] testing maps-test2001 [22:27:26] it finished fine [22:27:33] yep, alls good, continuing [22:27:37] yuvipanda: thank you for your help! [22:27:49] yw thciprian [22:28:04] thcipriani, want to do graphoid as well? :) [22:28:17] now we know whom to poke :D [22:28:44] thcipriani, ready for tilerator? [22:28:51] yurik: yup. watching. [22:29:33] its quiet ... too quiet ... [22:30:02] sudo /usr/sbin/service tilerator restart' returned non-zero exit status 1 [22:30:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [22:30:19] lovelly [22:30:50] (03PS3) 10Yuvipanda: diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) [22:31:19] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:47] thcipriani, tilerator seems to be happily running :) [22:31:56] on maps-test -for the past week :) [22:32:10] and for about two minutes on maps [22:32:31] * yurik feels proud for creating such a monster... [22:32:41] * yurik changes his nick to Frankenstein [22:33:02] yurik: hmm, for some reason deploy-service only has permission: (root) NOPASSWD: /usr/sbin/service tileratorui * but not for tilerator [22:33:04] * yurik remembers that it didn't end well the last time [22:33:04] (03CR) 10Ori.livneh: [C: 04-1] "Can you just set $handler to https://github.com/BrightcoveOS/Diamond/wiki/handler-NullHandler ?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [22:33:09] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.352 second response time [22:33:37] thcipriani, i can always restart it by hand of course [22:34:33] yurik: it's likely due to the double scap::node thing I would guess :\ [22:35:01] (03CR) 10jenkins-bot: [V: 04-1] diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [22:35:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [22:36:01] yurik: yup. that's exactly what's happening. [22:37:08] yurik: I'll probably need a little bit to poke at puppet stuffs. Until that change you may have to restart by hand :( [22:37:17] (03PS4) 10Yuvipanda: diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) [22:37:33] thcipriani, ok, what should i change to make it non-autorestartable? [22:38:03] yurik: if you comment out service_name and service_port in scap/scap.cfg [22:38:06] that should do it. [22:38:12] ok [22:39:22] (03PS5) 10Yuvipanda: diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) [22:40:08] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [22:40:23] thcipriani, https://gerrit.wikimedia.org/r/#/c/294186/ [22:40:32] (03PS1) 10Chad: Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 [22:40:50] legoktm: Hehe ^ :) [22:40:58] yurik: +1'd [22:41:00] (03PS19) 10Andrew Bogott: MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [22:41:03] (03PS1) 10Andrew Bogott: Define PUPPETMASTER_API for Horizon [puppet] - 10https://gerrit.wikimedia.org/r/294188 [22:41:07] thcipriani, deploying... [22:41:09] (03CR) 10Yuvipanda: [C: 032] diamond: Allow disabling via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294179 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [22:41:29] (03CR) 10Legoktm: [C: 031] Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 (owner: 10Chad) [22:41:41] yuvipanda: I made a suggestion in my review comment [22:41:44] thcipriani, done, seems all good [22:41:45] !log Deployed patches for T129738 to wmf5 [22:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:57] ori I fixed it I think [22:41:58] yurik: wheee :) [22:42:19] PROBLEM - configured eth on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:42:21] yuvipanda: > Can you just set $handler to https://github.com/BrightcoveOS/Diamond/wiki/handler-NullHandler ? [22:42:21] !log switched to scap3 and deployed tilerator. Deployed kartotherian. Restarted. [22:42:24] doesn't matter i guess [22:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:30] if you plan to fix it [22:42:35] thcipriani, want to tackle graphoid while we are doing it? [22:42:48] (03CR) 10Chad: [C: 032] Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 (owner: 10Chad) [22:43:01] yurik: also some puppet work needed there. next round of updates? [22:43:03] ori bah, completely missed the non-inline comment, sorry. [22:43:11] 'sokay, lots of jenkins noise [22:43:11] ostriches: oh wait, set $wgExtDistCandidateSnapshot == 'REL1_27'; [22:43:27] (03CR) 10Legoktm: [C: 04-1] "Needs $wgExtDistCandidateSnapshot = 'REL1_27';" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 (owner: 10Chad) [22:43:37] thcipriani, sure, poke me when you want to continue on it. How's graphoid different from others ? [22:44:00] ori I'm going to revert it (haven't puppet merged) and take a deeper look. I know the current one is no-op (tested it). [22:44:19] RECOVERY - configured eth on install2001 is OK: OK - interfaces up [22:44:38] yurik: I'm not sure if it is different just yet, still need to make and test patch for it. Thank you for migrating your services: I appreciate it :) [22:44:48] legoktm: Ah forgot about that [22:44:54] We should leave it commented out when not in use :) [22:45:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [22:45:12] yeah, it's a new thing [22:45:20] because we got rid of the WikimediaMessages i18n stuff [22:45:25] (03PS1) 10Yuvipanda: Revert "diamond: Allow disabling via hiera" [puppet] - 10https://gerrit.wikimedia.org/r/294189 [22:45:32] (03PS2) 10Chad: Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 [22:45:34] thcipriani, thanks for working on it! scap depl seems soooooo much nicer than the git depl :) [22:45:59] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "diamond: Allow disabling via hiera" [puppet] - 10https://gerrit.wikimedia.org/r/294189 (owner: 10Yuvipanda) [22:46:32] (03CR) 10Legoktm: [C: 031] Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 (owner: 10Chad) [22:49:26] (03CR) 10Chad: [C: 032] Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 (owner: 10Chad) [22:50:04] (03Merged) 10jenkins-bot: Update extension distributor branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294187 (owner: 10Chad) [22:50:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [22:51:15] !log demon@tin Synchronized wmf-config/CommonSettings.php: Update extension distributor settings (duration: 00m 24s) [22:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:34] (03PS1) 10Yuvipanda: diamond: Allow setting handlers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294191 (https://phabricator.wikimedia.org/T137753) [22:51:41] ori ^ is more elegant I think [22:51:51] yuvipanda: hey, do you know why some tables don't exist in replica, and where is the list and how can add tables? [22:51:54] :) [22:52:17] Amir1 I can't pretend I didn't see it can I? [22:52:19] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: puppet fail [22:52:53] :D, tell me when you have some time [22:53:01] no rush, send me an email [22:53:01] Amir1: long story short is we need to run a perl script, but I think we'd prefer to do that post wikimania. Is that acceptable? [22:53:31] yuvipanda: of course, I need it for ORES extension stuff [22:53:45] running some analytics [22:54:43] nothing urgent [22:54:47] Hi. [22:54:59] Amir1: https://www.mediawiki.org/wiki/Talk:ORES_review_tool doesn't have a description [22:55:02] ostriches: it'll start populating the 1.27 tarballs in an hour or two [22:55:08] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [22:55:32] Dereckson: I'll add on in a sec [22:58:19] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [22:58:29] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [22:58:35] Done now [22:58:51] not a great thing [22:58:56] but works for now [22:59:37] Nice. It helps to give context. [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160613T2300). [23:00:04] jan_drewniak, Amir1, and MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [23:00:14] hi. [23:00:23] o/ [23:00:25] o/ [23:01:08] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [23:01:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:02:19] PROBLEM - Disk space on elastic1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80076 MB (15% inode=99%) [23:03:13] Okay, I can SWAT tonight. [23:03:31] (03CR) 10Yuvipanda: [C: 032] diamond: Allow setting handlers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/294191 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [23:04:03] (03PS2) 10Dereckson: T134010 T136874 removing AB test & deploying banner survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294107 (owner: 10Jdrewniak) [23:04:13] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294107 (owner: 10Jdrewniak) [23:04:49] (03Merged) 10jenkins-bot: T134010 T136874 removing AB test & deploying banner survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294107 (owner: 10Jdrewniak) [23:04:58] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:05:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [23:05:09] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:05:37] (03PS1) 10Yurik: Make all map sources public for admins of tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/294192 [23:05:39] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [23:05:58] yuvipanda: for https://gerrit.wikimedia.org/r/#/c/294191/1/modules/standard/manifests/diamond.pp , didn't you mean to make labs's handler the null handler? [23:06:38] ori nope, I intended that to be a no-op change [23:06:47] and am preparing a follow up change that does the appropriate null / non-null handling [23:06:52] ah cool [23:06:53] yeah lgtm [23:07:14] cool, thanks! It's a nicer solution than diamond_enabled [23:07:28] RECOVERY - Disk space on elastic1002 is OK: DISK OK [23:08:38] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [23:08:38] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [23:08:48] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [23:08:49] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [23:09:40] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [23:10:00] !log dereckson@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 24s) [23:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:08] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [23:10:22] (03PS2) 10Yurik: Make all map sources public for admins of tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/294192 (https://phabricator.wikimedia.org/T137053) [23:10:25] !log dereckson@tin Synchronized portals: (no message) (duration: 00m 24s) [23:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:32] jan_drewniak: please test ^ [23:11:07] Dereckson: looks good, thanks! [23:11:13] yuvipanda, sorry to bug you again, could you +2 https://gerrit.wikimedia.org/r/#/c/294192/ [23:11:22] its a minor config change for the admin service [23:11:31] (tileratorui) [23:11:40] jan_drewniak: thanks for testing [23:11:44] MatmaRex: hmmm [23:11:46] This is a no-op change. It should be deployed after I2489004271078a is [23:11:49] merged, but before it is deployed. Old code is compatible with both [23:11:52] old and new config, but new code is only compatible with new config. [23:12:18] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [23:12:28] (03PS1) 10Yuvipanda: labs: Disable diamond labswide (enable for 3 projects) [puppet] - 10https://gerrit.wikimedia.org/r/294196 (https://phabricator.wikimedia.org/T137753) [23:12:58] MatmaRex: normally in such scenarii, the only requisite is to update config before core change merge [23:13:16] That doesn't matter if it's merged or not. [23:13:34] What matters of course is you're going to merge it. [23:14:01] Dereckson: hmm? [23:14:11] yurik y'know, at this point I've to basically blindly +2 that - I've no idea what tilerator even does, nor what allowPublicSources means. Why is this config in a repo where only ops can merge this? why isn't this part of something y'all can deploy? [23:14:12] oh hell, the mediawiki/core merge failed D: [23:14:23] failure is from https://integration.wikimedia.org/ci/job/mediawiki-core-phpcs-trusty/2290/console [23:14:45] Dereckson: it's a false failure :/ please consider https://gerrit.wikimedia.org/r/#/c/293355/ merged [23:14:54] yuvipanda, because that's how services are setup :( You are right, it makes absolutelly no sense [23:15:08] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [23:15:26] Dereckson: i'm not really sure what you're saying though [23:15:35] indeed. I don't think I feel comfortable merging this since the little writing up on it suggests it's something that involves access controls restring something to admins only, but I've no idea if that's true. sorry. [23:15:49] yuvipanda, i added that setting about two hours ago to simplify map administration, clearly I should be the one controlling it, just like i control the wmf-config when i deploy a new service. Silly, i know. [23:15:55] (03PS4) 10Dereckson: Update cross-wiki upload configuration for I2489004271078a [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 (owner: 10Bartosz DziewoƄski) [23:15:59] I don't disagree at all [23:16:11] we should probably have a different config repo for services? [23:16:23] * yurik pokes gwicke and mobrovac [23:16:24] and am feeling very silly and change averse myself, but I really don't think I should merge changes I Don't fully understand. [23:16:25] yup [23:16:38] RECOVERY - DPKG on maps1002 is OK: All packages OK [23:16:38] RECOVERY - Disk space on maps1002 is OK: DISK OK [23:16:44] but they obviously wouldn't know anything about it either :) [23:16:52] (03PS5) 10Dereckson: Update cross-wiki upload configuration for I2489004271078a [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 (owner: 10Bartosz DziewoƄski) [23:16:57] indeed. [23:17:13] the specifics of the change i mean. They might have an idea of how to configure a proper services config outside of puppets [23:17:21] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 (owner: 10Bartosz DziewoƄski) [23:17:39] or maybe some hardcore opsy would :) [23:17:44] Dereckson: ah, i see now. okay. [23:17:53] (03Merged) 10jenkins-bot: Update cross-wiki upload configuration for I2489004271078a [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 (owner: 10Bartosz DziewoƄski) [23:18:00] sure. the change should be deployed by someone who does know what it is and understand the full implications of it and can roll it back if necessary. We're at an impasse just now because the set of people who know what the change is is completely disjoint with the people who have +2 rights on that repo [23:18:16] MatmaRex: live on mw1017 [23:18:17] Dereckson: i just meant not to merge it before the mediawiki/core change is finalized. (which turned out to be a good idea, because the config format in it changed) [23:18:23] sorry, i was on a call with dell for the past 30 [23:18:27] did that change break stuff? [23:18:49] this is a fundamental oragnizational problem, and I don't really think I can do much to help. sorry! [23:18:52] Dereckson: oh, ugh, i don't have the magic to access that set up. :/ [23:19:05] What browser do you use? [23:19:36] Dereckson: to verify, go to any page, open VE, then Insert->Media->Upload, and verify that you get an upload form and not an error message [23:19:51] For Chrome, magic starts at https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb, for Firefox at https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/ [23:20:05] Dereckson: opera. i can probably make it use chrome extensions [23:20:12] but you changed on testwiki too, so you can without that test on test.wikipedia.org [23:20:18] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [23:20:18] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:20:20] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:20:37] Dereckson: oh, test.wikipedia.org is (still) mw1017 only? okay [23:20:38] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [23:20:39] yep [23:21:11] By the way, I remember in DragonFly, you can directly add to ask a custom header to requests. In such case, you can ask "X-Wikimedia-Debug: mw1017" without any need of an extension. [23:21:46] Dereckson: hmm, then i'm not seeing the change. wgForeignUploadTargets in JS console is still [], should be ['local']. maybe just caching [23:22:20] hu ? [23:22:24] $ mwrepl [23:22:37] hphpd> print_r($wgForeignUploadTargets) [23:22:43] it gives me an arary with local [23:23:09] https://test.wikipedia.org/w/load.php?debug=true&lang=en&modules=startup&only=scripts&skin=vector [23:23:12] "wgForeignUploadTargets": [], [23:24:29] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [23:24:30] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [23:24:59] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:08] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 23 failures [23:25:09] 23:19:36 < MatmaRex> Dereckson: to verify, go to any page, open VE, then Insert->Media->Upload, and verify that you get an upload form and not an error message [23:25:20] I confirm I've still an upload form [23:25:37] yurik: there is some hope that the scap3 thing finally makes some headway towards saner config management [23:25:58] Dereckson: yeah. wgForeignUploadTargets on testwiki is still using the old value for me, though [23:26:05] grr bellatrix. fixing... [23:26:18] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [23:26:46] we have been asking for saner config management for a while: https://phabricator.wikimedia.org/T93428 [23:26:56] https://test.wikipedia.org/w/load.php?debug=true&lang=en&modules=startup&only=scripts&skin=vector [23:27:01] "wgForeignUploadTargets": [ [23:27:01] "local" [23:27:42] hmmm if I disable X Wikimedia Debug header indeed I've an empty array [23:27:59] so we also need this header for test., okay [23:28:19] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:28:30] yeah. anyway, if you see that on mw1017, that's good then :) [23:28:41] Dereckson: i guess testwiki isn't served by it only anymore? [23:29:09] (try wgHostname in JS console, i get various servers) [23:29:56] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Update cross-wiki upload configuration ([[Gerrit:293355]]) (duration: 00m 23s) [23:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:08] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 72 seconds ago with 0 failures [23:30:36] Okay, so have a browser ready with the extension, or find how to inject header manually on DragonFly for next SWAT, you'll be able to test it yourself. [23:31:07] yeah. thanks :) [23:31:11] http://www.opera.com/dragonfly/documentation/network/ [23:31:35] in Network tab of the console > Network option it seems [23:31:45] Dereckson: i have the new opera too, which is basically reskinned chrome. it should be able to use chrome extensions [23:31:51] RECOVERY - Disk space on maps1002 is OK: DISK OK [23:31:51] RECOVERY - DPKG on maps1002 is OK: All packages OK [23:32:10] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [23:32:31] Amir1: okay so Zuul has merged 294115, we can test proceed [23:32:42] Dereckson: nice [23:32:46] gwicke, i already migrated both kartotherian & tilerator to scap3, graphoid is the only one left. Any ETA on this? [23:32:50] PROBLEM - Disk space on elastic1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 77557 MB (15% inode=99%) [23:32:54] or a proposal of any sort? [23:33:29] Dereckson: works just fine [23:33:37] shall we move on to the other one? [23:33:39] MatmaRex: not sure all Chrome extension works for Opera, they seem to use their own extensions systems, probably the same extension engine, but not the same APIs [23:33:54] Amir1: which one works? [23:34:05] yurik: the ball is in releng's court [23:34:09] 294114 [23:34:19] gwicke is there a phab ticket? [23:34:26] Dereckson: 294103 [23:34:32] needs +2 [23:34:40] PROBLEM - MD RAID on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:34:44] Dereckson: at least some do, i had adblock or something installed at some point (before opera implemented built-in ad blocking) [23:34:53] yurik: not sure what the most up-to-date on config management in scap 3 is; thcipriani might know [23:35:00] Amir1: there is a problem [23:35:20] the last change in wmf/1.28.0-wmf.5 currently deployed is Merge "Use ores.wikimedia.org instead of ores.wmflabs.org" [23:35:31] I haven't yet deployed the link change. [23:35:40] Dereckson: you need to deploy it [23:35:50] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.015 second response time [23:36:10] Yes. And this is what I were doing when you told me 23:33:29 < Amir1> Dereckson: works just fine [23:36:11] thanks [23:36:23] So I thought you had tested it, and was puzzled how. [23:36:32] Dereckson: I was talking about beta wiki [23:36:39] RECOVERY - MD RAID on install2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:36:42] oh yes, indeed [23:36:42] 294103 [23:36:45] Sorry [23:36:50] en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences#mw-prefsection-betafeatures [23:37:30] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:37:55] (03PS2) 10Yuvipanda: labs: Disable diamond labswide (enable for 3 projects) [puppet] - 10https://gerrit.wikimedia.org/r/294196 (https://phabricator.wikimedia.org/T137753) [23:38:30] RECOVERY - Disk space on elastic1002 is OK: DISK OK [23:40:57] Okay 294115 live on mw1017 and all seems fine [23:41:09] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [23:41:09] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [23:42:30] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [23:42:41] !log dereckson@tin Synchronized php-1.28.0-wmf.5/extensions/ORES/includes/Hooks.php: Update links to beta features (duration: 00m 25s) [23:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:58] Now the config change. [23:43:29] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [23:43:29] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [23:44:30] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:44:35] jdloft: Amir1: to make things exit from beta to go to prod, you can request your own deployment window too by the way, that's helpful if you need time for an hotfix or check stuff [23:44:53] (03PS2) 10Dereckson: Add ORES to whitelisted beta features. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [23:45:01] Dereckson: we did everything already [23:45:09] (03CR) 10Yuvipanda: [C: 032] labs: Disable diamond labswide (enable for 3 projects) [puppet] - 10https://gerrit.wikimedia.org/r/294196 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [23:45:10] RECOVERY - Disk space on maps1002 is OK: DISK OK [23:45:10] RECOVERY - DPKG on maps1002 is OK: All packages OK [23:45:21] even tables and etc. are there [23:45:55] k [23:46:14] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [23:46:18] but for Wikidata, probably we'll get a deployment window [23:46:53] (03Merged) 10jenkins-bot: Add ORES to whitelisted beta features. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294103 (https://phabricator.wikimedia.org/T130211) (owner: 10Ladsgroup) [23:47:10] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:47:21] Could be valuable to ask A ude or H oo to be present at your window, they know very well how Wikidata is deployed and configured. [23:47:29] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [23:47:30] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:47:34] Amir1: live on mw1017, please test [23:48:50] Dereckson: works like a cahrm [23:48:55] *charm [23:48:57] \o/ [23:48:58] \o/ [23:49:20] Good, let's send them to prod then. [23:50:30] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [23:50:47] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add ORES to whitelisted beta features (T130211) (duration: 00m 23s) [23:50:48] T130211: Deploy ORES extension in fawiki - https://phabricator.wikimedia.org/T130211 [23:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:09] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [23:51:09] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [23:51:51] Amir1: I was able to enable it at https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%98%D9%87:%D8%AA%D8%B1%D8%AC%DB%8C%D8%AD%D8%A7%D8%AA [23:52:02] so looks good to me [23:52:52] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%98%D9%87:%D8%AA%D8%BA%DB%8C%DB%8C%D8%B1%D8%A7%D8%AA_%D8%A7%D8%AE%DB%8C%D8%B1 has already one highlighted change [23:53:02] yeah [23:53:04] Dereckson: thanks [23:53:06] nice [23:53:31] You're welcome. Thanks for testing. [23:53:53] congratulations, Amir1! [23:54:05] yuvipanda: thanks :) [23:54:18] now I'm so excited can't sleep :D [23:54:30] PROBLEM - Disk space on elastic1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 77449 MB (15% inode=99%) [23:54:30] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:54:32] :D [23:55:09] RECOVERY - Disk space on maps1002 is OK: DISK OK [23:55:10] RECOVERY - DPKG on maps1002 is OK: All packages OK [23:56:29] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:58:28] 9 revisions marked by ORES [23:58:47] I reverted several ones using the extension