[00:35:23] hmmm [00:37:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:37:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:39:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [00:39:18] Getting a lot of 503's and broken stylesheet loading [00:40:15] Doesn't look good https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=now-1h&to=now [00:41:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [00:42:19] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:42:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [00:43:58] moritzm, are you here? [00:49:37] seems steady again [00:52:16] Not for me. dewiki down for me for at least half an hour. [00:54:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:54:22] most likely that's unrelated [00:54:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:54:54] I've looked at these recent 5xx spikes, they're fairly isolated and overall they're low-rate vs all traffic. it's not a "site down" sort of thing [00:55:20] but I can't find a good explanation yet, either [00:55:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:56:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:56:15] they peak around 1.6% of requests returning 5xx [00:56:34] the spike duration has gotten a little wider each time, too [00:57:19] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:57:27] the first one that looks like the current pattern was about 3 minutes wide, then there was a ~6 minute one, and the latest is closer to ~11 minutes [00:57:43] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=1494798744351&to=1494809747081 [00:57:48] ^ the 3x spikes there [00:57:49] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:00:45] Anyway, working again here since about two minutes ago. [01:01:43] during the spikes, varnish stats show a dip in backend request rate (to e.g. MW), and a rise in total backend connections [01:01:58] that's the sort of pattern we'd expect if requests to MW (or another applayer backend) are stalling out and not answering quickly [01:05:09] mw fatalmonitor counts actually drop lower during the event [01:07:28] but one of the few fatals that does occur in that window is: Fatal error: request has exceeded memory limit in /srv/mediawiki/php-1.30.0-wmf.1/extensions/Echo/includes/DiscussionParser.php on line 641 [01:09:22] no such ooms in the earlier spike time ranges, though, so maybe that one's just a fluke [01:09:58] (I'm back to dewiki down, btw, but if you say it's unrelated, I'll be silent from now ;).) [01:12:16] pajz: you might try anonymous browsing (e.g. using chrome incognito), in case it's related to session/login -specific things [01:12:43] dewiki is definitely up for the bulk of traffic (which is anonymous) [01:13:15] (or it could be some local network condition, of course) [01:15:44] fwiw there doesn't seem to be a very specific pattern to the failing URLs. it hits API reqs, /wiki/Foo reqs, etc. It does seem to be MW reqs (as opposed to e.g. restbase or cxserver) [01:17:22] it's also not specific to certain varnishes afaics, spreads all around [01:17:51] well, frontends anyways... [01:18:51] now I'm getting somewhere though... it does seem to focus through cp1053.eqiad.wmnet as the backend-most varnish in the 5xx's [01:19:08] (which could mean that cache has a problem, but could also mean it's just the chash destination of some problematic traffic too) [01:19:10] bblack, thanks. Possible. Incognito/Switching browser doesn't help, but using a VPN does. So surely could be a local issue; curious timing, though (haven't run into issues for ages, and it's affecting only Wikipedia as far as I can tell). [01:22:50] [Mon May 15 00:36:48 2017] mce_notify_irq: 4 callbacks suppressed [01:22:50] [Mon May 15 00:36:48 2017] mce: [Hardware Error]: Machine check events logged [01:22:53] [Mon May 15 00:36:48 2017] CPU2: Core temperature/speed normal [01:22:56] [Mon May 15 00:36:48 2017] mce: [Hardware Error]: Machine check events logged [01:22:59] ^ cp1053 syslogs, probably a failing machine :( [01:24:10] the MCEs have been going on for at least a week though, but maybe things have gotten worse [01:25:17] !log depooled cp1053 from all services (possible hardware issues) [01:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:49] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [01:28:40] 06Operations, 10ops-eqiad, 10Traffic: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3261314 (10BBlack) [02:21:03] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 42s) [02:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 15 02:27:02 UTC 2017 (duration 5m 59s) [02:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:30] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1326.40 Read Requests/Sec=6045.20 Write Requests/Sec=342.20 KBytes Read/Sec=27591.20 KBytes_Written/Sec=7091.20 [04:23:29] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=11.60 Read Requests/Sec=5.00 Write Requests/Sec=4.00 KBytes Read/Sec=67.60 KBytes_Written/Sec=98.40 [05:23:29] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [06:23:37] 06Operations: ms-be2023 freeze - https://phabricator.wikimedia.org/T162854#3261604 (10MoritzMuehlenhoff) p:05Triage>03Normal [06:52:31] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3261623 (10tstarling) This sort of thing is much easier if there is a reproducible test case. Maybe we could parse a few different articles using benchmarkParse.php to try to trigger a crash. Failing th... [07:20:20] (03PS1) 10Giuseppe Lavagetto: build-alpine: do not error out if branch not present [puppet] - 10https://gerrit.wikimedia.org/r/353834 [07:21:05] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] build-alpine: do not error out if branch not present [puppet] - 10https://gerrit.wikimedia.org/r/353834 (owner: 10Giuseppe Lavagetto) [07:36:56] (03CR) 10Muehlenhoff: "This needs more context, what in particular is needed from the JDK packages?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353765 (owner: 10Paladox) [07:39:42] (03CR) 10Muehlenhoff: [C: 04-1] Fix debian-rules-missing-recommended-target (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [07:56:50] (03CR) 10Paladox: Fix debian-rules-missing-recommended-target (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [07:58:35] 06Operations, 10ops-eqiad: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3261853 (10akosiaris) Racking distribution sounds fine as well as naming. [08:00:05] Amir1: Dear anthropoid, the time has come. Please deploy ores_classification clean up party (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T0800). [08:05:24] 06Operations, 10ops-codfw: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3261883 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [08:06:07] I start cleaning up now [08:14:36] 06Operations, 10OTRS: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3261913 (10akosiaris) [08:14:40] 06Operations, 10OTRS: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3261929 (10akosiaris) p:05Triage>03Normal [08:16:54] (03CR) 10Alexandros Kosiaris: "I vaguely remember the same and it makes sense. I 'll rebase for branch 1.13" [debs/pybal] - 10https://gerrit.wikimedia.org/r/353525 (owner: 10Alexandros Kosiaris) [08:17:50] !log start of cleaning up ores_classification table [08:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:20] (03PS1) 10Alexandros Kosiaris: Change the default LVS BGP behavior per service [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/353836 [08:18:25] 06Operations: ms-be2023 freeze - https://phabricator.wikimedia.org/T162854#3261930 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I don't think we've seen a reoccurence of this, though it is odd for sure. Tentatively closing and we can reopen if it happens again. [08:26:10] !log installing rtmpdump security updates on jessie [08:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:46] !log swift eqiad-prod: ms-be1028/ms-be1039 object weight 3000 - T160640 [08:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:53] T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 [08:47:05] 06Operations, 10netops: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262009 (10fgiunchedi) [08:51:45] 06Operations, 10netops: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262036 (10fgiunchedi) I tried a traceroute from our side and it takes a different path ``` filippo@cr1-esams> traceroute 2.235.74.121 traceroute to 2.235.74.121 (2.235.74.121), 30 hops max, 40 byte p... [09:01:25] 06Operations, 10vm-requests, 05Goal, 07kubernetes: Set up kubernetes masters for codfw cluster - https://phabricator.wikimedia.org/T165291#3262063 (10akosiaris) [09:02:04] I cant connect to the wikimedia cluster – it seems like my shared ip is banned [09:03:40] freddy2k1: https://en.wikipedia.org/wiki/Wikipedia:IP_block_exemption [09:03:48] 06Operations, 13Patch-For-Review: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477#3262088 (10MoritzMuehlenhoff) 05Open>03Resolved rpcbind and nfs-common have been removed from all jessie hosts except those which actually use NFS. In addition our base d-i jessie installation strips nfs-common... [09:03:48] no, thats the error https://pastebin.com/DadUt36A [09:04:08] I requested already an ip block exemption, TheDragonFire [09:05:29] freddy2k1: It looks like you're behind some sort of proxy. [09:05:44] freddy2k1: can you reach https://phabricator.wikimedia.org ? [09:05:51] no, i cant [09:05:58] but i can access wmflabs and wikitech [09:06:36] yeah that makes sense, those are not in esams [09:06:59] freddy2k1: can you share a traceroute 91.198.174.192 and your source ip? [09:08:19] my source ip is 137.226.39.166 [09:09:43] thats the traceroute (can not work because the squid proxy works not on the same layer as traceroute): https://pastebin.com/HVHA7XnY [09:10:12] traceroute to the squid proxy is fine [09:10:56] !log swift codfw-prod: more ms-be2001/ms-be2012 decom - T162785 [09:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:05] T162785: Decomission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785 [09:11:23] I can access text-lb.eqiad.wikimedia.org via curl, but not text-lb.esams.wikimedia.org via curl [09:12:49] freddy2k1: yeah, thanks we've been receiving reports of similar issues for esams [09:13:20] okay, is there a way to access wikipedia now? [09:14:14] I should teach a class how to edit and research on wikipedia, but if there are those connection issues, I cant obviously [09:14:41] (03CR) 10Chad: [C: 04-2] "Nothing, this is not needed. Could possibly put it under suggests, but we only *require* the JRE. We already install the JDK via puppet." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353765 (owner: 10Paladox) [09:15:16] freddy2k1: we're investigating if we can resolve the issue soon yeah [09:15:53] okay fine [09:16:37] would it work, if use an ip adress located near to the eqiad cluster than an europe one? [09:17:16] freddy2k1: yeah that would work, the issue is that one of peers in europe is having trouble [09:17:55] (03CR) 10Chad: "Or, just stop using this package. Cf T157414" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [09:18:52] then i will try to find an open proxy to access eqiad. thanks a lot godog [09:19:53] (03Abandoned) 10Filippo Giunchedi: lvs: add logstash [puppet] - 10https://gerrit.wikimedia.org/r/324371 (https://phabricator.wikimedia.org/T151971) (owner: 10Filippo Giunchedi) [09:21:13] <_joe_> freddy2k1: try now? [09:22:15] yes, now it works perfectly _joe_ [09:22:21] thank you very much! [09:23:22] 06Operations, 10netops: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262009 (10ayounsi) Traffic is indeed not smooth as usual on the interface toward Init7. I called Init7 and disabled the v4 and v6 BGP sessions. The person I had on the phone mentioned that the engineer... [09:26:50] 06Operations, 10netops: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262155 (10ayounsi) Got confirmation on IRC that the issue can't be reproduced. [09:31:47] 06Operations, 10netops: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262169 (10ayounsi) a:03ayounsi [09:34:44] !log installing bind security updates (we're using client-side libs/tools only) [09:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:05] 06Operations, 10netops: Report of esams unreachable from fastweb - https://phabricator.wikimedia.org/T165288#3262251 (10ayounsi) From Init7: >We are experencing some BGP issues in our backbone. Troubleshooting is under way and I'll contact you once we fixed the issue. [09:46:05] (03PS1) 10Alexandros Kosiaris: Introduce acrux, acrab as codfw kubernetes masters [dns] - 10https://gerrit.wikimedia.org/r/353844 (https://phabricator.wikimedia.org/T165291) [09:51:54] 06Operations, 05Goal, 13Patch-For-Review, 15User-Joe, 07kubernetes: Upgrade calico to 2.2, document build process. - https://phabricator.wikimedia.org/T165024#3262286 (10Joe) [09:53:14] It seems it's cleaning up the table so fast, probably the load is super low or the table is now small that it can do faster lookups [09:54:02] (03PS1) 10Giuseppe Lavagetto: profile::calico::builder: use calico release info [puppet] - 10https://gerrit.wikimedia.org/r/353845 (https://phabricator.wikimedia.org/T165024) [09:54:52] (03CR) 10Giuseppe Lavagetto: [C: 031] Introduce acrux, acrab as codfw kubernetes masters [dns] - 10https://gerrit.wikimedia.org/r/353844 (https://phabricator.wikimedia.org/T165291) (owner: 10Alexandros Kosiaris) [10:03:38] 06Operations, 10Traffic, 13Patch-For-Review: varnish frontend transient memory usage keeps growing - https://phabricator.wikimedia.org/T165063#3262301 (10ema) 05Open>03Resolved a:03ema Crazy transient memory usage [[https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&from... [10:06:24] 06Operations, 10netops: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3262317 (10Nemo_bis) p:05Triage>03High [10:11:30] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::calico::builder: use calico release info [puppet] - 10https://gerrit.wikimedia.org/r/353845 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [10:14:24] !log installing fop security updates on trusty [10:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:46] !log installing batik security updates on trusty [10:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:50] 06Operations, 10netops: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3262418 (10Pyb) My connection is chaotic since this morning. Other customers from the french ISP Bouygues report the same problem. This is my traceroute results: |--------------------------------... [10:33:17] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce acrux, acrab as codfw kubernetes masters [dns] - 10https://gerrit.wikimedia.org/r/353844 (https://phabricator.wikimedia.org/T165291) (owner: 10Alexandros Kosiaris) [10:36:53] !log rebooting mw2224-mw2242 for update to Linux 4.9 [10:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:56] (03PS1) 10Filippo Giunchedi: logstash: move 'hostname' to 'host' for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/353853 (https://phabricator.wikimedia.org/T149451) [10:51:08] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3262486 (10fgiunchedi) [10:55:59] PROBLEM - HP RAID on ms-be1028 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:56:49] PROBLEM - HP RAID on ms-be1030 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:59:29] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [11:00:19] PROBLEM - HP RAID on ms-be1031 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:00:29] PROBLEM - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:00:39] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:00:51] <_joe_> swift troubles? [11:00:59] PROBLEM - HP RAID on ms-be1029 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:01:08] _joe_, normally not [11:01:31] there is an issue with HP RAID under disk load (it timeouts)ç [11:01:43] <_joe_> ok [11:01:46] I have the same issue with some dbs [11:01:49] PROBLEM - HP RAID on ms-be1038 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:02:00] <_joe_> yeah the load is not higher than 1 hour ago [11:02:05] as long as the host responds it is not normal, but a known issue [11:02:07] disk load [11:02:09] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:02:13] not necesarilly cpu or other load [11:02:17] <_joe_> ack [11:02:20] or you know [11:02:29] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [11:02:31] controller load of reporting stuff [11:02:32] I need five more minutes to finish cleaning up the table [11:02:44] cool to me [11:02:49] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:03:07] I will do a quick check that everthing else is working ok [11:03:19] sigh, rebalance in progress, the alarms are expected but I forgot to silence [11:03:22] doing now [11:03:42] ah, so that is it- but everthing I said is right? [11:04:30] T141252 ? [11:04:31] T141252: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252 [11:05:05] yeah :( [11:05:11] _joe_, FYI ^ [11:05:29] I stop the cleaning, it might help [11:05:49] Amir1_, what you do impacts databases [11:05:56] those are "image servers" [11:05:59] nothing to do [11:06:21] oh, okay [11:06:29] Overall it was almost done [11:06:38] I had already noticed your work here: :-) https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=7&fullscreen&orgId=1 [11:06:56] (03PS1) 10Alexandros Kosiaris: Put acrab, acrux in the correct block [dns] - 10https://gerrit.wikimedia.org/r/353856 [11:07:10] :D [11:07:11] (that is an exageration, because row writes are amplified once perserver [11:07:19] (03CR) 10Alexandros Kosiaris: [C: 032] Put acrab, acrux in the correct block [dns] - 10https://gerrit.wikimedia.org/r/353856 (owner: 10Alexandros Kosiaris) [11:07:42] yeah, so replicas will add to that number [11:09:08] !log cleaning up ores_classification has finished 18M rows deleted, current number of rows 38,937,217 (T159753) [11:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:16] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [11:09:47] We need to shrink that once the jobs gets deployed in WMF [11:10:02] in production I mean [11:10:25] yes, I know [11:10:33] it has to finish first :-) [11:11:24] Another 3-hour-window would be enough I think. I do it tomorrow. we'll see [11:12:02] don't worry, take your time [11:12:15] and agains, thanks for diong this [11:13:00] :) hope that'd be useful [11:18:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [11:19:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [11:20:45] (03PS1) 10Giuseppe Lavagetto: Add debian/repack to ease the upgrade process [calico-cni] - 10https://gerrit.wikimedia.org/r/353857 (https://phabricator.wikimedia.org/T165024) [11:21:21] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3262587 (10fgiunchedi) Diff in variables on db2048 (i.e. `connection_name` is added, no other changes) ``` -mysql_slave_status_connect_retry{channel_name="",master_... [11:40:28] jouncebot: refresh [11:40:30] I refreshed my knowledge about deployments. [11:40:31] jouncebot: next [11:40:32] In 0 hour(s) and 19 minute(s): Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T1200) [11:41:08] addshore: gilles: Zeljko and I are attending the releng offsite so we can not take care of SWAT [11:41:28] looks like there a single config change so it should not be too hard to handle :-} [11:44:55] (03PS2) 10Giuseppe Lavagetto: Add debian/repack to ease the upgrade process [calico-cni] - 10https://gerrit.wikimedia.org/r/353857 (https://phabricator.wikimedia.org/T165024) [11:47:27] where is uour offsite hashar ? [11:47:39] * hashar escapes to meeting [11:47:44] haha [11:49:04] I can swat :) [11:57:24] (03CR) 10Ottomata: [C: 031] "+1, but requires https://gerrit.wikimedia.org/r/#/c/352579 deployed first." [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [12:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T1200). Please do the needful. [12:04:09] (03PS1) 10Aude: Revert "Don't enable tabular-data data type yet on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353862 (https://phabricator.wikimedia.org/T164207) [12:06:42] (03PS1) 10Alexandros Kosiaris: Add acrux, acrab to the infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/353864 (https://phabricator.wikimedia.org/T165291) [12:09:31] (03PS1) 10Alexandros Kosiaris: Add kubemaster LVS service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/353865 [12:11:02] addshore: thanks! [12:15:02] (03CR) 10Alexandros Kosiaris: [C: 032] Add acrux, acrab to the infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/353864 (https://phabricator.wikimedia.org/T165291) (owner: 10Alexandros Kosiaris) [12:16:04] (03PS3) 10Giuseppe Lavagetto: Add debian/repack to ease the upgrade process [calico-cni] - 10https://gerrit.wikimedia.org/r/353857 (https://phabricator.wikimedia.org/T165024) [12:16:39] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add debian/repack to ease the upgrade process [calico-cni] - 10https://gerrit.wikimedia.org/r/353857 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [12:17:29] (03PS1) 10Giuseppe Lavagetto: New upstream version 1.8.3 [calico-cni] - 10https://gerrit.wikimedia.org/r/353867 [12:17:31] (03PS1) 10Giuseppe Lavagetto: Updating debian version [calico-cni] - 10https://gerrit.wikimedia.org/r/353868 [12:17:33] (03PS1) 10Giuseppe Lavagetto: package name change [calico-cni] - 10https://gerrit.wikimedia.org/r/353869 [12:26:57] (03CR) 10Aude: [C: 032] Revert "Don't enable tabular-data data type yet on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353862 (https://phabricator.wikimedia.org/T164207) (owner: 10Aude) [12:29:33] (03Merged) 10jenkins-bot: Revert "Don't enable tabular-data data type yet on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353862 (https://phabricator.wikimedia.org/T164207) (owner: 10Aude) [12:29:43] PROBLEM - swift-container-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:44] (03CR) 10jenkins-bot: Revert "Don't enable tabular-data data type yet on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353862 (https://phabricator.wikimedia.org/T164207) (owner: 10Aude) [12:30:03] PROBLEM - swift-account-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:03] PROBLEM - swift-object-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:03] PROBLEM - swift-container-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:03] PROBLEM - swift-account-reaper on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:04] PROBLEM - swift-container-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:04] PROBLEM - swift-object-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:04] PROBLEM - dhclient process on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:04] PROBLEM - swift-object-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:04] PROBLEM - swift-account-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:05] PROBLEM - swift-account-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:05] PROBLEM - swift-container-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:06] PROBLEM - salt-minion processes on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:13] PROBLEM - swift-object-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:33] RECOVERY - swift-container-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:30:53] RECOVERY - swift-object-server on ms-be1019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:30:53] RECOVERY - swift-account-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:30:53] RECOVERY - swift-account-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:30:53] RECOVERY - swift-object-auditor on ms-be1019 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:30:54] RECOVERY - swift-account-server on ms-be1019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:30:54] RECOVERY - swift-container-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:30:54] RECOVERY - swift-object-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:30:54] RECOVERY - swift-container-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:30:54] RECOVERY - dhclient process on ms-be1019 is OK: PROCS OK: 0 processes with command name dhclient [12:30:55] RECOVERY - salt-minion processes on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:30:55] RECOVERY - swift-account-reaper on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:30:56] RECOVERY - swift-container-server on ms-be1019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:31:03] RECOVERY - swift-object-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:33:00] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Enable data type for tabular data (duration: 00m 41s) [12:33:02] (03CR) 10Ema: [C: 031] Change the default LVS BGP behavior per service [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/353836 (owner: 10Alexandros Kosiaris) [12:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:37] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, and 2 others: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3262796 (10MoritzMuehlenhoff) I've built new HHVM packages with a patch as proposed by upstream in https://github.com/facebook/hhvm/issues/7779... [12:33:59] (03CR) 10Alexandros Kosiaris: [C: 032] Change the default LVS BGP behavior per service [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/353836 (owner: 10Alexandros Kosiaris) [12:34:22] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, and 2 others: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3262799 (10MoritzMuehlenhoff) (Tested in mediawiki-vagrant) [12:39:25] (03PS1) 10Ayounsi: LibreNMS: Use default OSM tiles provider + simplify syslog filtering [puppet] - 10https://gerrit.wikimedia.org/r/353871 (https://phabricator.wikimedia.org/T164911) [12:45:17] (03CR) 10Ayounsi: [C: 032] LibreNMS: Use default OSM tiles provider + simplify syslog filtering [puppet] - 10https://gerrit.wikimedia.org/r/353871 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [12:53:17] (03CR) 10Filippo Giunchedi: [C: 031] Setup apache vhost on scap proxies as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [12:53:18] 06Operations, 10netops, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3262843 (10ayounsi) [12:53:22] (03CR) 10Filippo Giunchedi: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [12:54:14] !log upload pybal 1.13.6 to apt.wikimedia.org/jessie-wikimedia/main [12:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:23] thanks akosiaris :) [12:58:58] 06Operations, 10netops: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3262862 (10Nemo_bis) [12:59:09] (03PS5) 10Addshore: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [12:59:28] aude, im guessing you are all done with your slot? :) [12:59:40] jouncebot: refresh [12:59:42] I refreshed my knowledge about deployments. [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T1300). Please do the needful. [13:00:04] schana, gilles, addshore, and James_F: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:12] o/ [13:00:15] here [13:00:21] \o [13:00:30] that timing on that refresh cmd from me was perfect xD [13:00:31] schana: yours is first! [13:00:39] (03CR) 10Addshore: [C: 032] Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [13:01:45] (03Merged) 10jenkins-bot: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [13:01:57] (03CR) 10jenkins-bot: Add QuickSurvey for reader segmentation research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353053 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [13:02:08] schana: is it testible before i sync it? [13:02:18] testible how? [13:02:27] I'm waiting to try it on wiki [13:02:31] on the mwdebug servers! [13:02:43] do you have the browser extension installed? [13:02:48] no [13:03:07] schana: chome or firefox? [13:03:10] chrome* [13:03:11] chrome [13:03:15] schana: https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb [13:03:28] your code is in mwdebug1002 right now [13:04:45] if I'm using the extension right, it doesn't look like the quick survey is live [13:04:56] (looking at wgEnabledQuickSurveys in console) [13:05:08] or trying page with ?quicksurvey=true [13:05:36] schana: did you set the server to mwdebug1002? [13:05:41] yes [13:05:42] and turn it on? [13:05:44] yes [13:05:51] what URL are you checking on? [13:05:57] https://de.wikipedia.org/wiki/Apple?quicksurvey=true [13:07:09] make sure read-only is off [13:07:23] it's off [13:07:52] try shift+f5 [13:08:24] I see the ext.quicksurveys.init module loaded when viewing the page from mwdeug1002 [13:08:50] addshore: gerrit change, ill see if i cannot see any changes when i try [13:08:54] link* [13:09:02] looks like the variable is now present for de [13:09:12] still using the debug extension [13:09:18] schana: so it works now? [13:09:33] I'm not able to trigger it with the url parameter [13:09:42] but that may be a QuickSurveys thing [13:09:48] I'm not familiar with that codebase [13:10:07] neither am I [13:10:21] https://www.mediawiki.org/wiki/Extension:QuickSurveys#How_to_load_a_specific_survey [13:10:24] for reference [13:10:51] schana: In your config, you've a 'coverage' option, perhaps that forbids the display? [13:10:55] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:353053|Add QuickSurvey for reader segmentation research]] T131949 T164769 T164894 T164960 T164963 (duration: 00m 40s) [13:11:01] To load a random survey append ?quicksurvey=true to the URL; [13:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] T164894: Test reader survey in multiple languages - Japanese - https://phabricator.wikimedia.org/T164894 [13:11:06] T164963: Test reader survey in multiple languages - Hebrew - https://phabricator.wikimedia.org/T164963 [13:11:06] T131949: Repeat the big English reader survey in one or two more languages - https://phabricator.wikimedia.org/T131949 [13:11:06] T164769: Test reader survey in multiple languages - Romanian - https://phabricator.wikimedia.org/T164769 [13:11:06] T164960: Test reader survey in multiple languages - German - https://phabricator.wikimedia.org/T164960 [13:11:07] setting the url parameter should force it on [13:11:08] it seems to only enable a survet [13:11:15] To load an external survey whose name is 'external example survey' append ?quicksurvey=external-survey-external example survey to the URL. [13:11:18] this is the one you should use [13:11:23] https://de.wikipedia.org/wiki/Apple?quicksurvey=Reader-segmentation-3-de-test [13:11:26] still doesn't work [13:11:36] when the browser's "Do Not Track" feature is turned on; [13:11:36] on skin Minerva when the beta optin panel is shown; [13:11:36] if a survey is an external one and points to non-https location when the config variable `wgQuickSurveysRequireHttps` is set to `true`. [13:11:43] it wont show [13:12:36] (03PS1) 10Ema: VCL: lower grace for transient n-hit-wonder objects [puppet] - 10https://gerrit.wikimedia.org/r/353874 (https://phabricator.wikimedia.org/T165063) [13:13:19] works in firefox with the debug extension [13:13:26] must be some chrome setting [13:13:39] I'm still here, sorry. [13:13:50] https://de.wikipedia.org/wiki/Apple?quicksurvey=internal%20example%20survey doesn't work either [13:13:54] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3262945 (10fgiunchedi) List of candidates for deletion: (note some criteria might overlap) | Criteria | Count | Bytes (GB) | | -- | -- | -- | | W... [13:14:11] (03CR) 10BBlack: [C: 031] VCL: lower grace for transient n-hit-wonder objects [puppet] - 10https://gerrit.wikimedia.org/r/353874 (https://phabricator.wikimedia.org/T165063) (owner: 10Ema) [13:14:15] schana: do not track must be on [13:14:16] "enabled": true, [13:14:29] I just checked ja he ro and de [13:14:33] This one could be the issue [13:14:33] they all work for me in firefox [13:14:48] but not on Chrome? [13:15:01] gilles: are you around and do your changes have to go out together? [13:15:01] let me check Dereckson [13:15:07] (03CR) 10Ema: [C: 032] VCL: lower grace for transient n-hit-wonder objects [puppet] - 10https://gerrit.wikimedia.org/r/353874 (https://phabricator.wikimedia.org/T165063) (owner: 10Ema) [13:15:08] addshore: mwdebug1002 right? [13:15:08] not on chrome [13:15:13] but I might have do not track on [13:15:48] Zppix: yes, well, everywhere, the sync has been done [13:15:55] turning do not track off makes the survey load in chrome [13:16:13] PROBLEM - Check correctness of the icinga configuration on tegmen is CRITICAL: Icinga configuration contains errors [13:16:20] schana: my issue is this: [13:16:24] var_dump($wgQuickSurveysConfig[1]['enabled']) [13:16:24] bool(false) [13:16:26] schana: Dereckson it works on my end [13:16:40] i use chrome [13:16:40] James_F: are you happy for both of your changes to be deployed at once? [13:17:03] Yes. [13:18:22] Dereckson: I'm not sure what you're referring to [13:19:24] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/Cognate/src/CognateStore.php: SWAT: [[gerrit:353860|Add a clear-first option to populatePages script]] T164407 PT 1/2 (duration: 00m 40s) [13:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:32] T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407 [13:19:33] addshore: I'm here and yes they can go out together [13:20:11] gilles: can, or should? :) [13:20:18] they don't have to, whichever way saves you time [13:20:23] Great! [13:20:29] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/Cognate/maintenance/populateCognatePages.php: SWAT: [[gerrit:353860|Add a clear-first option to populatePages script]] T164407 PT 2/2 (duration: 00m 39s) [13:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:52] schana: to the fact sample surveys are disabled (the first two, 0 and 1), only yours is enabled (the third, 2) / ack'ed it works [13:23:30] (03PS1) 10Muehlenhoff: package_builder: Install patchutils [puppet] - 10https://gerrit.wikimedia.org/r/353875 [13:25:23] (03CR) 10Alexandros Kosiaris: [C: 031] package_builder: Install patchutils [puppet] - 10https://gerrit.wikimedia.org/r/353875 (owner: 10Muehlenhoff) [13:26:04] (03PS1) 10Nschaaf: Disable test reader QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353876 (https://phabricator.wikimedia.org/T131949) [13:27:45] !log uploaded HHVM 3.18.2+dfsg-1+wmf3 to apt.wikimedia.org (addresses segfault in XML reader (T162586, T165074) [13:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:53] T165074: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074 [13:27:54] T162586: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586 [13:28:26] *twiddles thumbs waiting for jenkins* [13:29:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [13:29:44] (03PS13) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [13:29:58] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3262991 (10MoritzMuehlenhoff) This is fixed in 3.18.2+dfsg-1+wmf3. So far this has only been reproduced with the test case from the test suite, I'll keep this bug open until it's fully rolled out to... [13:30:34] addshore: do you have a timestamp of when the surveys went live? [13:30:52] 13:10 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: Add QuickSurvey for reader segmentation research T131949 T164769 T164894 T164960 T164963 (duration: 00m 40s) [13:30:52] T164894: Test reader survey in multiple languages - Japanese - https://phabricator.wikimedia.org/T164894 [13:30:53] T164963: Test reader survey in multiple languages - Hebrew - https://phabricator.wikimedia.org/T164963 [13:30:53] T131949: Repeat the big English reader survey in one or two more languages - https://phabricator.wikimedia.org/T131949 [13:30:53] T164769: Test reader survey in multiple languages - Romanian - https://phabricator.wikimedia.org/T164769 [13:30:53] T164960: Test reader survey in multiple languages - German - https://phabricator.wikimedia.org/T164960 [13:31:02] thanks [13:32:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [13:33:40] James_F: looks like jenkins finally merged them [13:33:55] Yay. [13:34:00] Pulling to mw1002? [13:34:03] will do [13:34:11] should be there now James_F [13:34:27] strange to have SF people awake at this hour :) [13:35:03] addshore: LGTM. [13:35:09] ack! [13:35:16] aude: I'm currently in London. [13:35:30] yeah, figured :) [13:36:22] (03PS1) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [13:36:30] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 (owner: 10Giuseppe Lavagetto) [13:36:36] <_joe_> mobrovac: ^^ [13:36:48] kk [13:36:48] <_joe_> going to apply it one cluster at a time, starting with aqs [13:36:49] James_F: syncing [13:37:05] Ta. [13:37:24] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/VisualEditor: SWAT: [[gerrit:353861|#1]] [[gerrit:353863|#2]] T165238 T165238 VisualEditor (duration: 00m 41s) [13:37:27] James_F: ^^ [13:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:34] T165238: Source editor fails to load on direct non-view page loads where the wiki doesn't have Single Edit Tab enabled - https://phabricator.wikimedia.org/T165238 [13:37:44] gilles: looks like the mediawiki tests are still running for yours! nearly there! [13:37:54] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:58] addshore: Thanks! [13:38:53] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:54] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:39:53] gilles: looks like they have been merged! [13:39:57] indeed [13:40:57] gilles: they are on mwdebug1002 [13:41:04] testing [13:41:25] <_joe_> mobrovac: no changes on any cluster for now, next I'm gonna reenable puppet everywhere but the dev cluster [13:41:56] _joe_: already applied to RB prod? [13:42:04] <_joe_> to one machine [13:42:06] <_joe_> noop [13:42:09] kk [13:42:22] <_joe_> I usually do one machine per role, basically [13:43:42] <_joe_> I'm going to apply it to restbase-dev1001 now [13:44:31] <_joe_> heh I forgot one commit to the private repo [13:46:21] addshore: seems to work for djvu, I can't find a video small enough to pass through the debug header bug where we can't uppload large files. will test once it's deployed [13:46:32] synicng [13:46:37] urm... syncing... :P [13:46:58] James_F: when are you heading to vienna? [13:47:03] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:47:10] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/TimedMediaHandler/handlers: SWAT: [[gerrit:353505|Fix X-Content-Dimensions support]] T150741 (duration: 00m 40s) [13:47:14] addshore: Thursday. [13:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:17] T150741: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741 [13:47:37] addshore: You? [13:47:52] 06Operations, 05Prometheus-metrics-monitoring: Add Prometheus machine metric to track core dumps - https://phabricator.wikimedia.org/T165323#3263065 (10MoritzMuehlenhoff) [13:48:41] !log addshore@tin Synchronized php-1.30.0-wmf.1/includes/media/DjVu.php: SWAT: [[gerrit:353504|Add X-Content-Dimensions support to DjVu]] T150741 (duration: 00m 39s) [13:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:55] (03PS2) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [13:49:00] <_joe_> mobrovac: puppet is applying correctly; you might want to restart restbase once I'm done [13:49:03] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [13:49:21] yup _joe_, let me know once it's applied on the whole dev cluster [13:49:28] gilles: all done! [13:49:43] James_F: I'm in prague, but also heading to vienna on thursday [13:49:49] Aha. See you there. :-) [13:49:59] You might be on the same flight as Tom and a couple of others :P [13:50:08] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3263083 (10ayounsi) @fgiunchedi indeed, it's happening again. During those jobs, ports are completely saturated. Because of the nature of t... [13:50:10] <_joe_> mobrovac: btw restbase was configured to contact eventbus in eqiad [13:50:11] addshore: second patch works fine, thank you very much [13:50:17] <_joe_> from every DC [13:50:25] <_joe_> was that by design or by omission? [13:50:26] lovely, and thus SWAT is done! [13:50:54] !log upgrading mwdebug servers to 3.18.2+wmf3 [13:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:59] _joe_: as it should, in the dev cluster as it exists only in eqiad [13:52:02] 06Operations, 06Release-Engineering-Team, 10Traffic: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3263087 (10Gilles) [13:52:32] <_joe_> mobrovac: no, every server points to eqiad [13:52:55] _joe_: ok, let's step back, which rb cluster are we talking about? [13:53:16] <_joe_> mobrovac: all of them had eventlogging_service_uri: "http://eventbus.svc.eqiad.wmnet:8085/v1/events" [13:53:32] <_joe_> production, test, dev, both DCs [13:53:47] <_joe_> I just changed it to the discovery URI for the dev cluster [13:54:14] <_joe_> but I wanted to check if this is eqiad-only by design [13:55:40] lemme see the cp config and will answer it _joe_ :P [13:57:56] _joe_: omission, it was done that way as we were using only one DC, but now that we have two in operation it can be local [13:58:15] <_joe_> ok thanks [13:58:17] _joe_: this is only relevant for purges as rb only sends purge events to eventbus [13:58:23] <_joe_> ok [13:58:43] <_joe_> so if we send purge events to EB in codfw, would it be seen by clients in eqiad? [13:59:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:00:04] _joe_: CP clients? yes [14:00:32] <_joe_> uhm ok [14:02:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:05:30] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3263125 (10akosiaris) [14:13:56] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubemaster LVS service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/353865 (owner: 10Alexandros Kosiaris) [14:14:00] (03PS2) 10Alexandros Kosiaris: Add kubemaster LVS service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/353865 [14:14:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add kubemaster LVS service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/353865 (owner: 10Alexandros Kosiaris) [14:20:31] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3263163 (10akosiaris) [14:20:33] 06Operations, 10vm-requests, 05Goal, 13Patch-For-Review, 07kubernetes: Set up kubernetes masters for codfw cluster - https://phabricator.wikimedia.org/T165291#3263160 (10akosiaris) 05Open>03Resolved a:03akosiaris kubernetes master `acrab` and `acrux` are up and running and LVS service IP `10.2.1.8`... [14:21:23] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3150620 (10akosiaris) [14:22:03] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, and 2 others: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3263168 (10Ladsgroup) Just saying this also happens in travis instances causing Wikibase travis tests to fail. https://travis-ci.org/wikimedia/... [14:22:15] _joe_: still applying? [14:22:27] <_joe_> mobrovac: sorry, no, done [14:22:56] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6415" [puppet] - 10https://gerrit.wikimedia.org/r/353047 (owner: 10Giuseppe Lavagetto) [14:23:51] ok thnx _joe_, will restrat now then [14:24:52] !log mobrovac@tin Started restart [restbase/deploy@c70a1e1] (dev-cluster): Restart after applying https://gerrit.wikimedia.org/r/#/c/352851/ [14:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:35] _joe_: dev cluster looking good! [14:29:50] <_joe_> cool [14:30:02] <_joe_> I'll move on with the other changes then [14:30:13] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [14:30:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:30:34] thnx [14:31:06] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, and 2 others: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3263202 (10MoritzMuehlenhoff) @Ladsgroup I'm not sure how that Travis setup is configured, but if you make it install HHVM 3.18.2+dfsg-1+wmf3,... [14:32:11] (03Abandoned) 10Paladox: Install openjdk jdk version instead of jre [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353765 (owner: 10Paladox) [14:33:08] (03CR) 10Paladox: "> Or, just stop using this package. Cf T157414" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [14:33:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:39:47] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3263242 (10Papaul) @akosiaris i am getting this while trying to install the systems ┌────────────────────┤ [!!] Partition disks ├──────────────────┐... [14:43:25] 06Operations, 10Traffic: Can't upload large files with X-Wikimedia-Debug turned on - https://phabricator.wikimedia.org/T165324#3263258 (10greg) (not really a RelEng task, we care about the debug servers and use them, but Ops manages them and the nginx config) [14:46:25] (03PS3) 10Giuseppe Lavagetto: cassandra::instance: allow use of default values [puppet] - 10https://gerrit.wikimedia.org/r/353047 [14:47:42] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra::instance: allow use of default values [puppet] - 10https://gerrit.wikimedia.org/r/353047 (owner: 10Giuseppe Lavagetto) [14:49:24] (03PS3) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [15:06:00] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3263423 (10RobH) [15:07:52] !log mobrovac@tin Started deploy [citoid/deploy@3ed34ef]: Better publishing date extraction support - T132308 [15:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:00] T132308: Figure out how to deal with incomplete dates, i.e. year only or year and month only - https://phabricator.wikimedia.org/T132308 [15:10:42] !log mobrovac@tin Finished deploy [citoid/deploy@3ed34ef]: Better publishing date extraction support - T132308 (duration: 02m 49s) [15:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3263525 (10Cmjohnson) I've been contacted by Dell regarding the support task. The part is back ordered and may be a few more days. [15:20:10] 06Operations, 10ops-codfw: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3263577 (10RobH) a:03RobH Ok, I'll steal this task for the decom, because it has to have a few things. 1) All decom tasks should be flagged with #hardware-requests 2) All decom tasks should have the decom checklist cop... [15:22:38] (03PS4) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [15:24:13] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [15:26:13] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [15:27:33] 06Operations, 10ops-codfw, 10hardware-requests: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3263657 (10RobH) a:05RobH>03faidon [15:29:16] 06Operations, 10ops-codfw, 10hardware-requests: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3263664 (10faidon) a:05faidon>03RobH Sounds fine, approved. [15:30:15] 06Operations, 10ops-codfw, 10hardware-requests: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3263666 (10RobH) [15:30:18] (03PS5) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [15:37:31] 06Operations, 10ops-eqiad: decommission indium - https://phabricator.wikimedia.org/T165345#3263693 (10Jgreen) [15:38:03] 06Operations, 10ops-eqiad: Analytics1040 system board repair needed - https://phabricator.wikimedia.org/T164942#3263708 (10Cmjohnson) The new system board has been ordered through Dell but is back ordered....Should hopefully be in this week. [15:39:26] !log upgrade pybal to 1.13.6 across the LVS fleet [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:12] (03PS6) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [15:52:20] 06Operations, 10media-storage, 13Patch-For-Review, 15User-fgiunchedi: Implement storage policies for swift - https://phabricator.wikimedia.org/T151648#3263739 (10fgiunchedi) [15:55:21] (03PS7) 10Giuseppe Lavagetto: restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 [16:00:37] (03PS3) 10Paladox: Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 [16:01:11] (03PS5) 10Paladox: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 [16:21:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [16:21:53] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2025896 [16:22:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [16:24:31] 06Operations, 10media-storage, 15User-fgiunchedi: Running swiftrepl is not puppetized - https://phabricator.wikimedia.org/T162123#3263834 (10fgiunchedi) [16:24:33] 06Operations, 15User-fgiunchedi: Reduce Swift technical debt - https://phabricator.wikimedia.org/T162792#3263833 (10fgiunchedi) [16:29:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:32:23] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61501 MB (12% inode=99%) [16:32:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:44:23] RECOVERY - Disk space on elastic1025 is OK: DISK OK [16:44:55] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: convert test cluster to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353048 (owner: 10Giuseppe Lavagetto) [16:45:06] 06Operations, 10DBA: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#3263938 (10jcrespo) we will repurpose this for stretch, we'll keep probably 10.0 on jesssie using inet.d. [16:46:15] 06Operations, 10DBA: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3263939 (10jcrespo) [16:46:51] What CPU's do Wikipedia use? [16:48:36] setup_: it's all Intel except one system [16:49:32] interesting [16:49:46] Which specs are they [16:50:03] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:03] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:55:23] <_joe_> urandom: cerium is done, I restarted restbase and cassandra there with no issues [16:56:41] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3264019 (10RobH) So part of the issue on this system is it is a lease, not WMF owned. We cannot just use shelf spares, since we have to use ap... [16:57:21] _joe_: k, i'll have a look-see [16:59:15] _joe_: LGTM [16:59:24] <_joe_> urandom: great! [16:59:31] <_joe_> I'll do the main cluster tomorrow then [16:59:53] _joe_: you planning on a restart there as well? [16:59:58] or was that a precaution here? [16:59:59] <_joe_> no [17:00:00] k [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T1700). [17:00:08] <_joe_> it was here since the seeds list is wrong [17:00:14] right [17:00:15] <_joe_> I hope that's not the case for the main cluster [17:00:23] heh [17:01:52] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3264027 (10Papaul) Wed 5/10/2017 10:45 AM Thank you Papaul, I have put in a request to Intel Support. They will reply with a form that we will... [17:03:33] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3264036 (10Papaul) Thu 5/11/2017 10:29 AM from Please see below. Bo Rivera Please see below. Please see below. Hello, An update was made to s... [17:05:33] _joe_: so i did this https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter/DeploymentServer&diff=prev&oldid=1759120 but in the actual operations/switchdc i dont see it yet [17:07:08] <_joe_> mutante: sorry, it wasn't that, it was maintenance_server [17:07:25] <_joe_> (but I wanted to underline we need to check there too) [17:07:36] _joe_: ah, ok, yea makes sense [17:08:13] i see the mediawiki.py for that, yep [17:11:53] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 19864 [17:15:32] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3264134 (10RobH) Ok, I'm going to attempt to summarize what I know to be the current issue(s) with elastic2020. * System has issues starting b... [17:22:16] !log mobrovac@tin Started deploy [restbase/deploy@c70a1e1] (dev-cluster): Bring RESTBase up to date in the Dev Cluster [17:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:38] 06Operations: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3264136 (10Reedy) [17:24:08] !log mobrovac@tin Finished deploy [restbase/deploy@c70a1e1] (dev-cluster): Bring RESTBase up to date in the Dev Cluster (duration: 01m 51s) [17:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:21] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3264141 (10Papaul) Email Dasher about the failed SSD may 1 Hello Brynden, I received the main board and was in the processing of installing and... [17:26:43] 06Operations, 10ops-eqiad, 10netops: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3264152 (10Cmjohnson) [17:27:15] 06Operations, 10ops-eqiad, 06Analytics-Kanban: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3264155 (10Cmjohnson) [17:33:56] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3264163 (10Cmjohnson) A case has been opened for this server. Let's work this one and them move on to the others...ms-be1016, 1019 and 1020 should be included in the list. Your case... [17:37:40] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3264166 (10RobH) Ok, I've emailed Dasher to inquire about this with the following: > Dasher Folks, > > So it seems some of this conversatio... [17:48:25] addshore: during the SWAT earlier, did you deploy only to group0? group1 and group2 are also running 1.30.0-wmf.1 [17:52:24] I deployed them to everything running the branch! [17:53:27] gilles: why? [17:53:50] ah, thanks, just checking [17:57:29] 06Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1003 (stat1004 or misc name?) - https://phabricator.wikimedia.org/T165366#3264224 (10RobH) [17:57:58] 06Operations, 10procurement: rack/setup/install replacement to stat1002 (stat1004 or misc name?) - https://phabricator.wikimedia.org/T165368#3264256 (10RobH) [17:58:03] 06Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1003 (stat1005 or misc name?) - https://phabricator.wikimedia.org/T165366#3264272 (10RobH) [17:58:45] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3264282 (10RobH) [17:59:47] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017036 (10RobH) [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T1800). Please do the needful. [18:00:05] schana, Jdlrobson, and raynor: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:18] here \o [18:00:19] hello [18:00:26] 06Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3264307 (10RobH) [18:00:28] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3264301 (10RobH) 05Open>03Resolved This is ordered and being received in on linked #procurement task, as well as setup on task T165368. As such, this #hw-request... [18:00:36] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3264308 (10RobH) 05Open>03Resolved This is ordered and being received in on linked #procurement task, as well as setup on task T165366. As such, this #hw-request... [18:01:31] hello o/ [18:09:46] who's doing swat today? all of releng are out [18:09:57] aude: RainbowSprinkles RoanKattouw Dereckson ? [18:10:13] I can do it but lemme find a charger first [18:10:19] I would like my laptop to not die halfway :) [18:10:23] thanks RoanKattouw [18:11:17] RoanKattouw: I was on the Dover-Calais ferry yesterday [18:11:20] I so wanted to deploy something [18:11:41] Uh, no I wasn't [18:11:43] Dover-Dunkirk [18:12:19] haha [18:12:28] I tried to investigate today's VE UBNs from a train [18:12:48] Found a charger? [18:12:49] But the train wifi was broken and my 3G was pretty slow, so I didn't get anywhere before my train arrived [18:13:08] Yup I'm plugged in [18:14:22] (03CR) 10Catrope: [C: 032] Disable test reader QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353876 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [18:15:06] that's what is cool about the Amtrak trains over here, they are slow enough for 3/4G roaming to still work [18:15:25] (03Merged) 10jenkins-bot: Disable test reader QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353876 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [18:15:31] (and electric outlet) merged from train , succesfully [18:15:51] (03CR) 10jenkins-bot: Disable test reader QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353876 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [18:16:23] schana: Your patch is on mwdebug1002, please test [18:16:28] ack [18:17:24] looks good, thanks [18:20:39] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Disable test reader QuickSurveys (T131949, T164769, T164894, T164960, T164943) (duration: 00m 40s) [18:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:50] T164894: Test reader survey in multiple languages - Japanese - https://phabricator.wikimedia.org/T164894 [18:20:50] T164943: Outline needed changes to github-webhook - https://phabricator.wikimedia.org/T164943 [18:20:50] T131949: Repeat the big English reader survey in one or two more languages - https://phabricator.wikimedia.org/T131949 [18:20:50] T164769: Test reader survey in multiple languages - Romanian - https://phabricator.wikimedia.org/T164769 [18:20:51] T164960: Test reader survey in multiple languages - German - https://phabricator.wikimedia.org/T164960 [18:31:11] @seen hashar [18:31:11] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 5/15/2017 11:47:41 AM (6h43m29s ago) [18:31:54] RoanKattouw: ready when you are [18:32:04] Oh it finally merged [18:32:15] Sorry it was taking so long that I had gotten distracted catching up on other backlog [18:32:28] Thanks for the ping [18:33:07] jdlrobson: Ready for you on mwdebug1002 [18:33:18] \o/ [18:33:28] so raynor you know how to test this? [18:34:06] yes [18:34:25] I think yes [18:37:46] RoanKattouw: it works properly on debug1002 - good to go [18:37:57] RoanKattouw: yup same here [18:38:03] sync away [18:38:40] Cool, syncing [18:42:24] Hmm [18:42:26] 18:39:36 Check 'Logstash Error rate for mw1279.eqiad.wmnet' failed: ERROR: 18% OVER_THRESHOLD (Avg. Error rate: Before: 0.16, After: 2.00, Threshold: 1.63) [18:42:35] Let's see if I can find out what that was [18:42:49] It was only one of the canaries so I'm a bit skeptical [18:44:16] Oh it's because it's an API host, and there's an unrelated error coming from the API [18:44:24] Which I will fix later [18:44:32] Now trying to sync again, let's see if it'll let me get away with it this time [18:45:04] !log Canary failing on mw1279 due to Wikimedia\Rdbms\Database::makeList: empty input for field rev_id from ApiQueryRevisions [18:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:14] OK, it passed this time [18:45:16] !log catrope@tin Synchronized php-1.30.0-wmf.1/extensions/MobileFrontend/: Revert "Use csrf token for watching" (T165209) (duration: 00m 41s) [18:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:24] T165209: Watchstar feature broken: Tapping the watchlist star while logged in results in "mobile-frontend-watchlist-error" popup message - https://phabricator.wikimedia.org/T165209 [18:50:25] RoanKattouw: i've got to rush off but raynor will double check on production. Thanks for your help today! :) [18:50:55] Aha, the API error is fixed in master already thanks to andre__ [18:50:57] *anomie [18:51:01] yup - I'm here, just let me know when to check it [18:52:10] raynor: It's in production already, so please test it there now [18:52:34] You already tested it in debug so it's probably fine, but you can never have too much testing :) [18:52:35] on it [18:54:08] tested on two wikis, works [18:54:51] RoanKattouw: thanks for deployment, everything works properly \o/ [18:57:00] 06Operations, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#1396785 (10AndyRussG) @Krinkle Thanks so much for the explanation!... [19:01:52] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3264711 (10RobH) [19:03:14] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3255501 (10RobH) @ayounsi: You can now login to https://wikimedia.gocyrusone.com/ via your email address, and use the password reset option to get your codfw login details. You are no... [19:18:51] 06Operations, 10ops-eqiad: rack/setup/install replacement to stat1002 (stat1004 or misc name?) - https://phabricator.wikimedia.org/T165368#3264840 (10RobH) [19:19:39] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1002 (stat1004 or misc name?) - https://phabricator.wikimedia.org/T165368#3264256 (10RobH) [19:19:44] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1003 (stat1005 or misc name?) - https://phabricator.wikimedia.org/T165366#3264842 (10RobH) [19:32:32] (03CR) 10Dzahn: "the jdk package is already installed on contint1001, but not on contint2001. (manually installed?). going ahead." [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [19:32:44] (03PS4) 10Dzahn: Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [19:34:20] (03CR) 10Dzahn: [C: 032] Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [19:36:24] (03CR) 10Dzahn: "contint1001: no-op contint2001: Notice: /Stage[main]/Jenkins/Package[openjdk-7-jdk]/ensure: ensure changed 'purged' to 'present'" [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [19:37:52] (03PS4) 10Dzahn: Labs contint: Install php5-gmp and php7.0-gmp [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [19:42:29] !log catrope@tin Synchronized php-1.30.0-wmf.1/includes/api/ApiQueryRevisions.php: T165100 (duration: 00m 40s) [19:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:38] T165100: Wikimedia\Rdbms\Database::makeList: empty input for field rev_id - https://phabricator.wikimedia.org/T165100 [19:55:53] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T2000). Please do the needful. [20:00:16] Nothing for ORES [20:02:36] deploying parsoing in a little bit [20:02:38] parsoid [20:04:32] !log ssastry@tin Started deploy [parsoid/deploy@132d0e5]: Updating Parsoid to a182c227 [20:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:01] no MCS deploy today [20:10:11] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3265015 (10Paladox) Not sure if we should bother doing this as I found problems when upgrading a gerrit install f... [20:11:54] !log ssastry@tin Finished deploy [parsoid/deploy@132d0e5]: Updating Parsoid to a182c227 (duration: 07m 21s) [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:02] !log Updated Parsoid to a182c227 (T141226, T164792, T37247, T153107, T163091, T164006, T161151, T162920, T163549) [20:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:20] T162920: In multi-content/template-block scenarios, Linter displays "--" in the "Through a template"? column - https://phabricator.wikimedia.org/T162920 [20:20:20] T161151: Parsoid should resolve template paths before providing them to Linter - https://phabricator.wikimedia.org/T161151 [20:20:20] T164792: Add class mw-parser-output to Parsoid's output - https://phabricator.wikimedia.org/T164792 [20:20:20] T164006: Suggestion: API for fetching lint errors for a specific revision - https://phabricator.wikimedia.org/T164006 [20:20:20] T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext - https://phabricator.wikimedia.org/T163091 [20:20:20] T153107: Parsoid is generating [[Foo|Foo]] instead of [[Foo]] for some VE edits - https://phabricator.wikimedia.org/T153107 [20:20:21] T37247: content-holding
should only contain the page text - https://phabricator.wikimedia.org/T37247 [20:20:21] T141226: Missing data-mw content in wikitext leads to html2wt exceptions - https://phabricator.wikimedia.org/T141226 [20:20:22] T163549: Only lint pages that have wikitext contentmodel - https://phabricator.wikimedia.org/T163549 [20:20:44] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.156 second response time [20:21:43] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.232 second response time [20:25:22] Hmm. That polluted the 'mention' lists on all those tasks with tasks that are completely unrelated except that they happen to have been deployed in Parsoid at the same time. [20:26:32] anomie: that's probably intentional, to note that the fixes for those have been deployed, since i guess parsoid doesn't have a strict release schedule like mediawiki. [20:26:50] oh, you mean, each task refers to each other. hmm. [20:27:05] that's silly but harmless! [20:29:50] anomie, i suppose the alternative is to have n log statements .. which can be painful. [20:30:32] or maybe a new stashbot feature to edit out unrelated mentions. [20:30:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [20:30:54] but what is the standard practice for doing this? [20:32:12] I don't know, most things ride the train and so probably could use the tags like "MW-1.30-release-notes (WMF-deploy-2017-05-23_(1.30.0-wmf.2))" that get bot-added on merge. What's the problem being solved by mentioning all the task numbers? [20:32:32] we resolve tasks once they are gerrit-merged. [20:32:50] the stashbot mention is a notification that the code is now actually live in production. [20:33:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [20:38:59] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265142 (10Papaul) @Robh yes we do; but there are 300GB [20:42:40] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265143 (10RobH) @Papaul: The spares tracking shows that we have 3 of the Intel S3610 800GB ssds on the spare shelf? We recently ordered thes... [20:47:09] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265144 (10Papaul) @Robh yes we do have some 800GB SSDs for spare but the one we are trying to replace is DC S3500 series. [20:48:16] !log run refreshImageMetadata --force for group1 + group2 wikis except commons on terbium T150741 [20:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:23] T150741: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741 [20:53:45] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265154 (10RobH) Ahh, sorry for the miscommunication then. So, here is where we stand on this system * It is a lease, if a shelf spare is use... [20:54:42] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265155 (10RobH) @Gehel: Can you advise if this can remain offline for another week or two for the SSD replacement. See my comment above for f... [21:00:05] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T2100). [21:08:03] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:08] !log mobrovac@tin Started deploy [restbase/deploy@c52add0]: Expose the new /transform/wikitext/to/lint end point to the public - T163091 [21:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:16] T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext - https://phabricator.wikimedia.org/T163091 [21:15:36] (03PS1) 10RobH: decommission mw2098 [puppet] - 10https://gerrit.wikimedia.org/r/353918 [21:17:22] (03PS1) 10RobH: decommission mw2098 (production dns) [dns] - 10https://gerrit.wikimedia.org/r/353920 [21:18:48] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3265240 (10RobH) a:05RobH>03Papaul @Papaul: Before I move through the checklist and disable everything, I'll need to know what the switch port is for this server? The mw... [21:19:40] !log mobrovac@tin Finished deploy [restbase/deploy@c52add0]: Expose the new /transform/wikitext/to/lint end point to the public - T163091 (duration: 06m 32s) [21:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:48] T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext - https://phabricator.wikimedia.org/T163091 [21:20:21] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3265245 (10RobH) [21:20:35] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3252410 (10RobH) [21:23:22] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3265252 (10Gilles) [21:23:26] (03PS4) 10XXN: Fixing "Book_talk" namespace definition for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 [21:25:29] (03Draft2) 10Zppix: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 [21:30:01] jouncebot: next [21:30:01] In 1 hour(s) and 29 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T2300) [21:31:23] PROBLEM - Nginx local proxy to apache on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.152 second response time [21:31:23] PROBLEM - Apache HTTP on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [21:31:46] jouncebot: refresh [21:31:48] I refreshed my knowledge about deployments. [21:32:23] RECOVERY - Nginx local proxy to apache on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.174 second response time [21:32:23] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.100 second response time [21:36:03] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [21:49:22] hey does throttle.php support ipv6? [21:53:39] (03CR) 10Milimetric: "I have a couple of questions. First, does any other config need to change for the Collection extension to recognize the new namespace nam" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [21:56:11] (03CR) 10Hashar: "Danke!!!" [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [21:59:19] (03PS3) 10Zppix: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 [22:01:11] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265416 (10Gehel) Yes, elastic2020 can stay offline for one more week. [22:05:43] to evening swat swatter I may be a bit late fyi [22:16:50] !log mobrovac@tin Started deploy [restbase/deploy@d98af6f]: Wt2lint bug fix - T163091 [22:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:57] T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext - https://phabricator.wikimedia.org/T163091 [22:23:34] !log mobrovac@tin Finished deploy [restbase/deploy@d98af6f]: Wt2lint bug fix - T163091 (duration: 06m 44s) [22:23:40] (03PS5) 10XXN: Fixing "Book_talk" namespace alias for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 [22:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:43] T163091: Parsoid: Add API endpoint to get lint errors for arbitrary wikitext - https://phabricator.wikimedia.org/T163091 [22:24:22] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3265838 (10RobH) cool, we'll avoid using a shelf spare then and i'll be following up with dasher on a daily basis until resolution. [22:25:34] (03CR) 10XXN: [C: 031] "1. AFAIK - no; 2. The default Namespace definitions were already set in /r/#/c/139766/ In fact this is a namespace alias (for accessibili" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 (owner: 10XXN) [22:37:33] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [22:37:51] (03PS1) 10Dzahn: wikistats: add support for Debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/353926 [22:40:29] (03PS2) 10Dzahn: wikistats: add support for Debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/353926 [22:41:00] (03PS3) 10Dzahn: wikistats: add support for Debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/353926 [22:42:13] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:42:38] (03CR) 10Dzahn: [C: 032] wikistats: add support for Debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/353926 (owner: 10Dzahn) [22:51:32] (03PS4) 10Zppix: Raise the account creation limit for www.enwp.org/WP:Meetup/Eugene/WikiAPA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353921 (https://phabricator.wikimedia.org/T165421) [22:55:50] jouncebot: next [22:55:51] In 0 hour(s) and 4 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T2300) [22:57:33] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [22:59:13] (03PS1) 10Dzahn: wikistats: more stretch support, php-cli package [puppet] - 10https://gerrit.wikimedia.org/r/353928 [22:59:58] (03PS2) 10Dzahn: wikistats: more stretch support, php-cli package [puppet] - 10https://gerrit.wikimedia.org/r/353928 [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T2300). Please do the needful. [23:00:05] mooeypoo and Zppix: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] o/ [23:00:56] \o [23:01:26] (03CR) 10Dzahn: [C: 032] wikistats: more stretch support, php-cli package [puppet] - 10https://gerrit.wikimedia.org/r/353928 (owner: 10Dzahn) [23:02:18] !deploy_roulette [23:02:20] feel free to do mooeypoo's patch first as mine takes little time [23:02:30] * Zppix spins the bottle for mutante [23:02:35] * Zppix lands on mutante [23:02:44] xD [23:02:44] evades the bottle [23:03:05] mutante: the bottle cannot be evaded [23:03:05] * mooeypoo dances towards the bottle [23:03:34] Zppix: jouncebot says "user not found" in deployer list [23:04:08] jouncebot: told me that it needed a second to re query the db and that it queried again and it choose yours lol [23:04:20] mutante: [23:07:24] Zppix: nice try, can't re-use db connection. already open. seriously, i'm not deploying and jouncebot already pings the people [23:08:33] if you have something for puppet swat that wold be different [23:09:24] mutante: :( [23:09:45] whose gonna deploy [23:10:38] if no one can deploy, you just reschedule for another day unless its critically urgent [23:12:25] maybe they will deploy but at "Vienna"-window. [23:12:25] i thought this week was a code freeze week due to offsites? [23:12:51] ahh, there is a note that no train [23:13:01] and any swats involiving release engineering should be delayed [23:13:06] or others maybe avail [23:13:23] heh. thanks rob [23:13:34] https://wikitech.wikimedia.org/wiki/Deployments#Week_of_May_15th [23:14:09] .... but is there no swatters... [23:14:59] Zppix: your patch isn't critical for this week. The event it raises the limit for is 2 weeks out [23:15:21] Zppix: people are travelling today/tomorrow, i'm sure people will be willing to help later on [23:15:31] Wait, no swatters? [23:15:43] mooeypoo: you have the powers right? [23:15:47] I do not [23:15:54] I also am not sure how to use said powers [23:16:08] but as others have pointed out, non critical things should probably be delayed a little bit [23:16:16] This is fairly critical? [23:16:22] mooeypoo: looking [23:17:16] mooeypoo's patch is pretty safe looking and it fixes a user facing bug [23:17:19] I acn deploy it [23:17:21] *can [23:17:28] bd808: mine is basic doesnt even require testing [23:17:54] Yeah it's fixing a bad bug in RCFiltes, which people are excitedly using after the blog post [23:18:26] * bd808 waits for jerkins [23:18:47] * mooeypoo awaits to test/verify [23:20:16] mooeypoo: you should learn how to deploy too. It's both fun and useful. :) [23:20:36] you have the shirt already too! [23:20:49] * mooeypoo nods [23:20:56] I am worried I'd need a jacket [23:21:02] But yes, I should [23:21:09] What's the upgrade over a shirt? A hat? [23:21:30] I think a facial tattoo ;) [23:21:35] mooeypoo: if you are going to the hackathon, Reedy will probably teach you [23:21:39] rofl [23:21:43] * mooeypoo will ask [23:21:49] Roan can probably do that too [23:21:59] yes, also a good choice [23:22:05] I'll have a deploy-party. There should be cookies somewhere there. [23:22:44] Sam loves to do deploys from the hackathon. Bonus is that it always makes Greg nervous. [23:23:10] * bd808 watches little progress meters crawl on the zuul status page [23:23:17] air deploy from 10.000ft [23:23:48] I don't know if he's done that one yet. There was the English channel crossing train deploy though [23:24:05] * bd808 does not recommend [23:24:47] he lost wifi mid-scap and didn't finish until he had driven into Berlin [23:24:56] :o [23:25:00] domas whitepaged the site from the plane (and fixed it) [23:25:41] sam has deployed from everything I think plane/train/boat/meetups [23:26:04] Does that count for a face tattoo? [23:26:26] I'm just looking for the requirements [23:27:01] I was thinking something along the lines of the "poor impulse control" tattoo from Snow Crash [23:27:16] ha [23:27:23] that trusty test is not speedy... [23:27:42] mooeypoo: to get the face tattoo it requires deployment from the sun XD [23:27:52] bd808: i hope you mean jessie [23:28:37] Zppix: nope. we still run trusty on gate-and-submit. wikitech is still running on php 5.whatever [23:28:52] i thought releng got rid of trusty... [23:29:00] Zppix, what, like this ? https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/SPARCstation_1.jpg/220px-SPARCstation_1.jpg [23:29:15] no i mean ON the sun mooeypoo [23:29:24] E_TOOHOT [23:29:33] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [23:29:37] Zppix, it won't be very comfortable sitting on that one, though doable. [23:29:56] bd808: you mean WMF doesnt have standard-issue sun suits? [23:30:01] I had an E450 that made a good coffee table [23:30:16] Zppix, maybe I should've shared this one to be more explicit in my joke https://en.wikipedia.org/wiki/Sun_Microsystems#/media/File:SPARCstation_1.jpg [23:30:36] mooeypoo: thats when you file for unsafe working conditions :P [23:30:48] As opposed to deploying from The Sun [23:31:08] I am slowly building the image of what your requirements for this look like, Zppix [23:31:27] mooeypoo: and? [23:31:29] * bd808 grumbles that 12 minutes have passed and the tests are still running [23:31:34] https://wikitech.wikimedia.org/wiki/Obsolete:Sun_storage [23:31:39] bd808: try turning it off and on again :P [23:31:40] ^ yes, WMF once had a sun [23:32:00] toolserver was mostly Sun hardware too [23:32:18] no i mean this sun https://en.wikipedia.org/wiki/File:The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA%27s_Solar_Dynamics_Observatory_-_20100819.jpg [23:32:33] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [23:32:39] ^ talk about long link [23:33:02] Zppix, interesting, I don't recognize this version [23:33:07] Is it running Oracle? [23:33:24] mooeypoo: no its running Hot_AF v0.1 [23:33:47] Zppix, you haven't seen long links until you worked with RCFilters (incidentally, I'm working on a fix for that now...) [23:33:54] ok, jerkins finally finished [23:33:58] yay [23:34:02] where do I test [23:34:22] mooeypoo: no i have [23:35:16] for example mooeypoo this used to be a website www.thisisaverylongurlidontknowwhyiregisteredthis.com/youthoughtiwasdoneyouwerewrong/stillnotdone/hi [23:35:19] mooeypoo: Its on mwdebug1001 now [23:36:11] * mooeypoo goes to test [23:39:00] uhm.. I'm testing on enwiki with the chrome extension on 1001 and I don't see the fix running, am I doing it wrong? [23:39:19] * mooeypoo does a hard refresh [23:39:20] hmmm... maybe. It seems to be working for me. [23:39:22] hang on [23:39:29] https://www.mediawiki.org/wiki/Special:RecentChanges?hideliu=0&hideanons=0&userExpLevel=&hidemyself=0&hidebyothers=0&hidebots=1&hidehumans=0&hidepatrolled=0&hideunpatrolled=0&hideminor=0&hidemajor=0&hidelastrevision=0&hidepreviousrevisions=0&hidepageedits=0&hidenewpages=0&hidecategorization=1&hidelog=0&watchlist=&highlight=1®istration__hideanons_color=c5&changeType__hidenewpages_color=c1&userExpLevel__newcomer_color=c3 [23:39:46] YES! works now [23:39:52] sweet. [23:39:57] ok stupid chrome and its immovable cache [23:41:50] !log bd808@tin Synchronized php-1.30.0-wmf.1/resources/src/mediawiki.rcfilters/mw.rcfilters.Controller.js: RCFilters: Actually read/write highlight parameter (T165107) (duration: 00m 40s) [23:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:59] T165107: Highlight settings contained in RC Page URLs fail to load - https://phabricator.wikimedia.org/T165107 [23:42:03] \o/ [23:42:31] it may take a while for that to propagate through varnish [23:42:43] thanks bd808 [23:43:11] I just hard-refreshed without the testing extension on on enwiki and it works [23:43:13] thanks! [23:48:07] (03PS1) 10Dzahn: wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 [23:49:16] (03CR) 10jerkins-bot: [V: 04-1] wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 (owner: 10Dzahn) [23:51:21] (03PS2) 10Dzahn: wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 [23:52:22] (03CR) 10jerkins-bot: [V: 04-1] wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 (owner: 10Dzahn) [23:54:15] (03PS3) 10Dzahn: wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 [23:55:11] (03CR) 10jerkins-bot: [V: 04-1] wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932 (owner: 10Dzahn) [23:56:35] (03PS4) 10Dzahn: wikistats: puppetize deploy script [puppet] - 10https://gerrit.wikimedia.org/r/353932