[00:00:30] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381499 (10BBlack) Removed that, left in the actual LVS-level checks. Seems sane. [00:00:40] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381502 (10BBlack) [00:00:42] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381500 (10BBlack) 05Open>03Resolved a:03BBlack [00:02:00] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.019 second response time [00:02:11] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [00:03:58] 06Operations, 06Discovery, 06Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2381514 (10BBlack) Where are we at on this really? I understand that database initialization can't be automated, but is the rest of the setup pretty much a... [00:05:03] 06Operations: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2381517 (10BBlack) [00:05:05] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381516 (10BBlack) [00:05:22] 06Operations: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285911 (10BBlack) [00:05:24] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2381520 (10BBlack) [00:05:52] voltaire joke [00:08:05] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381527 (10BBlack) When reasonably confident on puppetization and monitoring, should get someone who's more-familiar with setting up internal services (in terms of proper puppetizat... [00:09:18] :) [00:12:35] bblack, thanks, maps are all grown up now :) should i be added to service broken pings? [00:13:03] also, other than icinga, any other places I should be looking at (other than logstas) [00:13:26] yurik: yeah I'm not sure how we'll configure that yet. it probably should alert a service team of some kind. [00:13:52] yurik: for now I think the karto.svc standard icinga check doesn't alert anyone (except icinga UI + IRC), and the maps-lb public checks will page ops. [00:14:18] understood. do you want gehel to sign off on the pupets before removing REFERER? [00:14:31] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:57] someone other than me for sure, but probably alex or joe or someone should take a peek too, since they've been here forever-longer. [00:15:32] bblack, ok, and should we remove REFERRER outright, or keep it WMF-only for now? [00:15:38] we want to find out sooner rather than later if someone in ops is going to have some valid complaint of the form "how the hell did this end up in production and paging me in the middle of the night when XYZ about it clearly not well-configured about it..." [00:16:01] heh [00:16:04] yurik: I don't know yet, that's a tough question. [00:16:26] there are many small sites that do amazing experiments with maps - like cross-map comparison, etc [00:16:30] and i really don't want to block them [00:16:31] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 3.246 second response time [00:16:44] on the other hand, we obviously want to block abusers somehow [00:16:52] there's not a lot of formal process built around these things in the org-wide general sense, although increasingly there's some for the services team specifically. [00:17:11] it's not often we deploy whole new services, much less ones that don't fit standard templates/teams already established. [00:17:31] btw, there is a HUGE demand for "youtubification" of services - where a graph or a map can be hosted in another site, served directly from us, with a link to us [00:17:50] yeah [00:18:24] I lean towards removing all the referer checks. it's what we've been architecting towards being able to support with all of this anyways, and we don't block referers on e.g. upload.wm.o image links. [00:18:26] someone makes a fun graph on a wiki page, and someone else clicks "share this graph" and gets an easy to use html code to use it [00:18:47] agree [00:18:59] I'm a little bit worried that we need to be very sure terms of use are sound, and what their implications are... [00:19:15] this one is controversial: we're in the "business" of supplying educational knowledge, not of supplying maps [00:19:17] e.g. if someone builds a commercial gmaps competitor that does their own navigation but uses us for tile storage :P [00:19:25] true. but the idea here is that its a full iframe, with all the content and copyright and link back [00:19:28] not just the raw image [00:19:43] bblack, easy enough to block them ;) [00:20:00] referer isn't much of a gaurantee anyways, it's not hard to hide referer [00:20:03] we just add back the REFERRER specifically for them [00:20:18] i thought its pretty tough to construct? [00:20:32] it's kinda like saying "We don't accept requests with the X-I-Am: Evil header". they can just not set it :) [00:21:00] browsers generally send referer, but I'm sure with js and/or plugins and who knows what else, it can be defeated. [00:21:42] there's that whole referrer-policy stuff too, that we implemented to start sending origin-only referer to other sites post-https? [00:21:48] possibly, not sure. It has proven to be a pretty good deterrent to all sorts of experiments ;) [00:22:07] https://www.w3.org/TR/referrer-policy/ [00:22:30] referer-blocking is a nice statement of intent, I think it only prevents nice people though :) [00:24:18] heh. i'm not really an expert on browser hacking, so not sure. I am pretty sure that its pretty hard to fake, simply because otherwise that hypothetical maps site could have used google's own images just as well [00:24:25] "Referrer-Policy: no-referrer" [00:24:40] or mapboxes [00:24:41] yeah but google probably cares less :) [00:24:52] both of which are much nicer looking at this point [00:24:58] so i wouldn't worry about that just yet :) [00:25:14] someday... yeah :D [00:26:01] anyway, i will ask gehel to sign off on puppet, review it with akosiaris or joe, and remove the referer stuff so that we can enable it on commons - i really want devs to play with it already [00:26:13] anyways, rewinding to earlier about review before calling the production status real: mainly it's that I'm not that familiar with our latest practices and directions that we've been going with stuff from e.g. the services team. [00:26:26] someone else will likely have more-insightful review on that level [00:26:36] sounds good [00:26:41] thanks for all your help!!! [00:26:47] np [00:26:49] seriously, you really kickstarted this thing a while back :) [00:27:01] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:22] I had to help erase my bad karma from failing at it a year or two before that :) [00:28:01] hehe [00:28:19] well, it was definitly a team effort, that's for sure :) [00:28:28] and required many moving parts [00:28:55] 24 big machines serving this puppy! (and all of them are basically idiling most of the time)( [00:33:22] (03CR) 10Dereckson: [C: 04-1] "Very weak community consensus. Sure the request proponent confirms it's okay to proceed, but (i) there is an opinion from a contributor th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) (owner: 10Urbanecm) [01:18:31] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1103.80 seconds [01:19:50] AaronSchulz: I updated the deployment documentation to mention submodule patches and to correct a couple of other minor details: https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&diff=653862&oldid=653516 [01:21:58] thanks [01:30:38] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/parser: 48652dfc27d1bbaab41b3a4d8f7d6be23e2da6b6 (duration: 00m 34s) [01:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:52] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes/parser: 78de24a20c4662ea709e1f8af84bb5fae4aea2fa (duration: 00m 33s) [01:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:34] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2381584 (10mmodell) 05stalled>03Open [01:35:50] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:50] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.021 second response time [01:47:20] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: puppet fail [01:51:15] !log ori@tin Synchronized php-1.28.0-wmf.5/resources/src/mediawiki.action/mediawiki.action.edit.stash.js: Idfad8407: Improve client-side edit stash change detection (duration: 00m 24s) [01:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:52:29] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:59] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [01:54:20] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.495 second response time [02:13:36] (03CR) 10Mholloway: "I just stumbled across some passing discussion of exactly this while looking at T120151 for unrelated reasons. The outcome was basically " [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [02:14:40] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:22:00] !log ori@tin Synchronized php-1.28.0-wmf.5/extensions/Scribunto/engines/LuaCommon/TitleLibrary.php: ad-hoc debug of vary-revision in scribunto (duration: 00m 26s) [02:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:57] ori: what's that regarding? [02:28:09] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:19] jackmcbarn: to gauge how often it gets called [02:28:35] I'm about to undo it [02:28:56] ah [02:29:23] ori: btw, have you seen https://phabricator.wikimedia.org/T137369 yet? [02:29:25] !log ori@tin Synchronized php-1.28.0-wmf.5/extensions/Scribunto/engines/LuaCommon/TitleLibrary.php: revert: ad-hoc debug of vary-revision in scribunto (duration: 00m 29s) [02:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:59] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 12m 24s) [02:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:04] looking [02:32:20] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.009 second response time [02:36:13] jackmcbarn: I don't remember exactly, but it seems like a grossly inefficient misfeature. If you're hosting an event and a cop shows up and flashes a badge, you just let the cop through -- you don't stop to scan the guest list, just so you can tell the cop "you wouldn't be getting in if you weren't on-duty, because you're not on the guest list" [02:37:08] ori: but you can't truthfully say "you're not on the guest list" without scanning the guest list [02:37:10] it seems logical to limit loading and using the blacklist to those cases where it could plausibly matter to spam [02:38:47] oh, i misread that [02:38:51] i see [02:39:41] it's just not useful to say "you're not on the guest list" when the person's going to get in anyway. But there is a defect, and I tend to agree with BethNaught about how to resolve it. That said, my admin experience is pretty light, so maybe I am misjudging the situation. [02:41:37] I'll comment on the task [02:41:46] thanks for flagging it, and sorry for not responding sooner [02:41:54] np [02:45:10] (03CR) 10BBlack: "Functionally in the VCL code, nothing is setting the X-ZeroTLS request header, and thus none of the dead code can do anything. The only e" [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [03:04:02] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 16m 29s) [03:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:10:57] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 15 03:10:57 UTC 2016 (duration 6m 55s) [03:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:30:08] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [03:44:17] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:18] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 7.552 second response time [03:49:47] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/parser/Parser.php: 23bac8905a9d60cdc0a068ca025644e091b9027f (duration: 00m 32s) [03:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:51:11] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes/parser/Parser.php: 4e6e1bc1f2de000f0fdd84dcf04f63a21127d24a (duration: 00m 30s) [03:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:57:38] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:40:13] (03PS1) 10KartikMistry: apertium-fra-cat: New upstream release, Rebuilt for Jessie [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/294425 [04:41:57] (03PS2) 10KartikMistry: apertium-fra-cat: New upstream release, Rebuilt for Jessie [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/294425 (https://phabricator.wikimedia.org/T137768) [04:49:35] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:43] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.020 second response time [04:57:09] (03Abandoned) 10KartikMistry: Add initial Debian package for giella-core [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [04:57:52] (03PS1) 10KartikMistry: giella-core: Initial Debian packaging [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/294426 (https://phabricator.wikimedia.org/T120087) [05:25:21] (03PS1) 10KartikMistry: giella-sme: Initial Debian packaging [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) [05:32:15] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [05:51:17] (03PS1) 10KartikMistry: apertium-es-pt: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/294431 (https://phabricator.wikimedia.org/T107306) [05:56:13] (03PS1) 10KartikMistry: apertium-eo-ca: Rebuild for Jessie and fixed dependencies [debs/contenttranslation/apertium-eo-ca] - 10https://gerrit.wikimedia.org/r/294432 (https://phabricator.wikimedia.org/T107306) [05:57:34] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:28] (03PS4) 10Giuseppe Lavagetto: mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 [06:31:03] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [06:31:04] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:31] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:11] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:22] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:31] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:58] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 (owner: 10Giuseppe Lavagetto) [06:43:31] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.030 second response time [06:46:22] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:48:11] PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: Puppet has 1 failures [06:50:20] RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:50:51] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 2 failures [06:55:01] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:40] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:40] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:58:02] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:21] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:58:58] !log installing apache trusty updates on eqiad app servers [06:59:01] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:00:21] (03CR) 10Urbanecm: "@Dereckson In the final voting in 2016 about 12 person participated with no objections. Kowiki has about 100 very active users (100+ edits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) (owner: 10Urbanecm) [07:05:27] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [07:23:09] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2381923 (10Danny_B) `/search/` (used to search commits by particular user) seems to have no straightforward way to be... [07:27:12] * gehel is checking this high response time on elasticsearch. Multiple fairly large merge were in progress during the slowdown... [07:28:05] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2381927 (10Joe) I honestly don't think we need to preserve every single link that was in gitblit (like links to the s... [07:39:44] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2381946 (10Gehel) [08:04:03] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: Connection timed out [08:05:23] PROBLEM - DPKG on mw1299 is CRITICAL: Timeout while attempting connection [08:05:43] PROBLEM - Disk space on mw1299 is CRITICAL: Timeout while attempting connection [08:06:24] PROBLEM - MD RAID on mw1299 is CRITICAL: Timeout while attempting connection [08:07:15] PROBLEM - configured eth on mw1299 is CRITICAL: Timeout while attempting connection [08:07:34] PROBLEM - dhclient process on mw1299 is CRITICAL: Timeout while attempting connection [08:07:44] PROBLEM - mediawiki-installation DSH group on mw1299 is CRITICAL: Host mw1299 is not in mediawiki-installation dsh group [08:08:13] PROBLEM - nutcracker port on mw1299 is CRITICAL: Timeout while attempting connection [08:08:24] PROBLEM - nutcracker process on mw1299 is CRITICAL: Timeout while attempting connection [08:08:43] PROBLEM - puppet last run on mw1299 is CRITICAL: Timeout while attempting connection [08:09:03] PROBLEM - salt-minion processes on mw1299 is CRITICAL: Timeout while attempting connection [08:09:15] PROBLEM - Check size of conntrack table on mw1299 is CRITICAL: Timeout while attempting connection [08:12:22] (03PS1) 10Jcrespo: Depool db1033 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294439 (https://phabricator.wikimedia.org/T133398) [08:13:42] (03CR) 10Jcrespo: [C: 032] Depool db1033 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294439 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [08:14:24] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:15:21] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1033 for cloning (duration: 00m 38s) [08:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:36] Could not resolve hostname mw2145.codfw.wmnet: No address associated with hostname [08:15:54] is that a new host, on dsh but not installed? [08:16:38] I see it up and working [08:18:20] and it resolves correctly from tin [08:18:33] and I can do pull from the host [08:19:21] <_joe_> jynus: hm that's strange indeed [08:19:28] <_joe_> it's not a new host at all [08:19:38] <_joe_> new hosts are mw2215+ [08:19:44] something is broken in between, I would say in mw config [08:20:28] <_joe_> why mw config? [08:20:34] <_joe_> where did you see that error? [08:20:38] just a guess [08:20:52] on scap sync [08:21:01] <_joe_> what's more probable is a failure in name resolution [08:21:03] <_joe_> of some kind [08:21:18] a temporary one? [08:21:33] because I pinged from tin immediately after and it worked [08:22:08] <_joe_> jynus: I'm not familiar enough with scap3 code, but I guess it doesn't retry on dns failures [08:29:30] !log turning down db1033 for cloning to new s7 slaves [08:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:14] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [08:31:35] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [08:31:43] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1033.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1033.eqiad.wmnet (111 Connection refused) [08:32:48] no worries [08:34:00] one day I will fix dbstore1001 [08:34:04] one day [08:40:16] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:42:07] PROBLEM - NTP on mw1299 is CRITICAL: NTP CRITICAL: Offset unknown [08:42:35] <_joe_> mw1299 is being installed [08:45:26] RECOVERY - dhclient process on mw1299 is OK: PROCS OK: 0 processes with command name dhclient [08:45:27] RECOVERY - Check size of conntrack table on mw1299 is OK: OK: nf_conntrack is 0 % full [08:45:37] RECOVERY - nutcracker process on mw1299 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:45:47] RECOVERY - DPKG on mw1299 is OK: All packages OK [08:45:58] RECOVERY - MD RAID on mw1299 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:46:17] RECOVERY - salt-minion processes on mw1299 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:46:18] RECOVERY - Disk space on mw1299 is OK: DISK OK [08:46:26] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.032 second response time [08:46:58] RECOVERY - nutcracker port on mw1299 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:47:04] _joe_ starting mw1274 and mw1275 [08:47:07] RECOVERY - configured eth on mw1299 is OK: OK - interfaces up [08:47:30] <_joe_> elukey: ok [08:53:49] !log depooling mw1154 (image scaler) for kernel update [08:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:07] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [08:57:17] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:58:16] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.009 second response time [08:58:46] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:59:32] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2382079 (10mobrovac) >>! In T133744#2381269, @BBlack wrote: > @Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lackin... [08:59:47] RECOVERY - NTP on mw1299 is OK: NTP OK: Offset 0.001037836075 secs [09:01:47] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 5 failures [09:07:56] <_joe_> 5 failures, all accounted for [09:08:03] <_joe_> 2 are my huge WTF [09:10:07] PROBLEM - DPKG on mw1154 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:53] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2382090 (10BBlack) >>! In T133744#2382079, @mobrovac wrote: >>>! In T133744#2381269, @BBlack wrote: >> @Yurik - T137617 does detailed service monitoring on each node (and probably s... [09:12:10] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382091 (10Paladox) @mmodell ^^ [09:12:27] RECOVERY - DPKG on mw1154 is OK: All packages OK [09:13:31] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382092 (10Paladox) @Danny_B search doesn't work on git at the moment so it is unlikely any one is using that. We cou... [09:13:47] !log repooled mw1154 (kernel still the same ATM) [09:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:07] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:17] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.020 second response time [09:27:17] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2382108 (10fgiunchedi) @Papaul please order one more spare since there's another disk waiting for replacement in {T137785} and should be 2TB SAS too [09:28:01] Hi, how can I request a new admin password for a mailing list? Both admins are locked out. The general password reset we got last November isn't working. [09:28:17] Investigating what happened would take more time than resetting that password (and yes, I have searched my mailbox extensively) [09:28:43] qgil, create a ticket on phabricator and our clinic duty will reset the password [09:29:36] Great, thank you jynus [09:29:37] 06Operations, 10ops-codfw: ms-be2008.codfw.wmnet: slot=1 dev=sdl failed - https://phabricator.wikimedia.org/T131147#2382110 (10fgiunchedi) 05Open>03Resolved disk is in service but I've never resolved the task [09:29:59] jynus, should I associate the task to #operations and/or something else? [09:29:59] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2382114 (10fgiunchedi) 05Open>03Resolved disk is in service but I've never resolved the task [09:30:20] qgil, probably Wikimedia-Mailing-lists ? [09:30:45] makes sense [09:32:39] 06Operations, 10Wikimedia-Mailing-lists: Please reset password of hackathonorganizers mailing list - https://phabricator.wikimedia.org/T137873#2382117 (10Qgil) [09:32:46] Done: https://phabricator.wikimedia.org/T137873 [09:33:18] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [09:33:26] Y saludos, jynus. ;) [09:33:35] saludos [09:34:02] (03PS1) 10Mobrovac: Kartotherian: Set up LVS checks [puppet] - 10https://gerrit.wikimedia.org/r/294454 [09:34:10] 06Operations, 10cassandra: 1000+ keyspace metrics you didn't see coming - https://phabricator.wikimedia.org/T137304#2382130 (10fgiunchedi) 05Open>03Resolved indeed those metrics have stopped updating now, I've removed `Keyspace` and `Client` now ``` $ du -hcs */org/apache/cassandra/metrics/{Keyspace,Clien... [09:35:30] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2382134 (10mobrovac) [09:35:32] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2382132 (10mobrovac) 05Resolved>03Open Reopening for setting up full LVS checks. [09:35:48] (03PS2) 10Mobrovac: Kartotherian: Set up LVS checks [puppet] - 10https://gerrit.wikimedia.org/r/294454 (https://phabricator.wikimedia.org/T137851) [09:37:44] (03CR) 10Mobrovac: "Note that this does not address the issue of not alerting the maps team." [puppet] - 10https://gerrit.wikimedia.org/r/294454 (https://phabricator.wikimedia.org/T137851) (owner: 10Mobrovac) [09:40:23] (03CR) 10Ema: Parametrize supplementary response headers in vcl_config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294171 (owner: 10Ori.livneh) [09:42:59] 06Operations, 10Traffic: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2382136 (10ema) p:05Triage>03Normal [09:43:12] (03PS1) 10Elukey: Force Varnishkafka to filter out Varnish Pipe logs. [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) [09:45:31] (03PS2) 10Elukey: Force Varnishkafka to filter out Varnish Pipe logs. [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) [09:45:32] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2382153 (10mobrovac) >>! In T133744#2382090, @BBlack wrote: > No, I think what you're referring to that Joe set up is the per-service-host checks. Before we started looking at this... [09:47:39] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.934 second response time [09:48:28] !log bounce ms-be2003, xfs high load [09:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:57] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.583 second response time [09:52:07] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [09:53:00] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:54:14] 06Operations, 10Traffic: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2382164 (10BBlack) Hiera usage isn't specific to VCL, it's all over the ops/puppet repo in several design patterns. Deciding to go against that is out of scope here, IMHO. That's not necessarily a defe... [09:55:20] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382165 (10Danny_B) @Paladox You're not right, it actually works... Cf. ie. https://git.wikimedia.org/search/?s=jen... [09:59:07] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382174 (10Paladox) @Danny_B oh sorry. I got my information from a task we closed as declined. Probaly need to type a... [09:59:25] 06Operations, 10Traffic: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2382175 (10BBlack) Also, I don't think the "3 scopes" section above understands hiera, or I don't. Setting `varnish_version4` as a top-scope hiera variable does not make it available anywhere in puppet... [09:59:31] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2382176 (10Gehel) [10:01:01] (03PS2) 10BBlack: X-Cache: fix missing "int" cases, add "err", "bug" [puppet] - 10https://gerrit.wikimedia.org/r/293721 [10:02:52] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2382186 (10Gehel) We should probably create a maps team with @Yurik, @MaxSem for kartotherian and tilerator alerts. [10:10:36] (03PS3) 10BBlack: X-Cache: fix missing "int" cases, add "err", "bug" [puppet] - 10https://gerrit.wikimedia.org/r/293721 [10:15:09] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:44] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: fix systemd unit files [puppet] - 10https://gerrit.wikimedia.org/r/294457 [10:17:19] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.024 second response time [10:17:40] (03PS1) 10Muehlenhoff: Reenable firejail wrapper for imagemagick's convert [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294458 (https://phabricator.wikimedia.org/T135111) [10:18:00] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:47] (03PS1) 10Jcrespo: Repool db1033, first pool of db1079, db1086, db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294459 (https://phabricator.wikimedia.org/T133398) [10:27:27] (03CR) 10Ema: [C: 031] X-Cache: fix missing "int" cases, add "err", "bug" [puppet] - 10https://gerrit.wikimedia.org/r/293721 (owner: 10BBlack) [10:27:35] (03PS4) 10BBlack: X-Cache: fix missing "int" cases, add "err", "bug" [puppet] - 10https://gerrit.wikimedia.org/r/293721 [10:27:37] (03PS1) 10BBlack: varnishxcache: remove support for legacy X-Cache [puppet] - 10https://gerrit.wikimedia.org/r/294460 [10:27:39] (03PS1) 10BBlack: varnishxcache: support new err/bug outputs [puppet] - 10https://gerrit.wikimedia.org/r/294461 [10:28:08] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:30:34] (03PS3) 10Elukey: Force Varnishkafka to filter out Varnish Pipe logs [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) [10:32:10] (03CR) 10BBlack: [C: 032] varnishxcache: remove support for legacy X-Cache [puppet] - 10https://gerrit.wikimedia.org/r/294460 (owner: 10BBlack) [10:33:08] (03CR) 10BBlack: [C: 032] varnishxcache: support new err/bug outputs [puppet] - 10https://gerrit.wikimedia.org/r/294461 (owner: 10BBlack) [10:33:16] (03CR) 10BBlack: [V: 032] varnishxcache: support new err/bug outputs [puppet] - 10https://gerrit.wikimedia.org/r/294461 (owner: 10BBlack) [10:34:01] (03CR) 10Ema: [C: 031] "Looks good! Perhaps after monitoring which requests have no Timestamp:Resp we could even be a little more aggressive and filter them out a" [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [10:37:35] (03PS1) 10Gergő Tisza: Send authentication events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294462 [10:41:48] (03PS1) 10BBlack: X-Cache: mark miss->hit_for_pass as pass [puppet] - 10https://gerrit.wikimedia.org/r/294464 [10:45:19] (03PS2) 10Giuseppe Lavagetto: mediawiki::jobrunner: fix systemd unit files, logging [puppet] - 10https://gerrit.wikimedia.org/r/294457 [10:45:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Reenable firejail wrapper for imagemagick's convert [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294458 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [10:48:03] !log jmm@tin Synchronized wmf-config/CommonSettings.php: firejail security hardening for image scalers (duration: 00m 26s) [10:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:50] (03CR) 10BBlack: [C: 032] X-Cache: fix missing "int" cases, add "err", "bug" [puppet] - 10https://gerrit.wikimedia.org/r/293721 (owner: 10BBlack) [10:50:36] (03CR) 10BBlack: [C: 032] X-Cache: mark miss->hit_for_pass as pass [puppet] - 10https://gerrit.wikimedia.org/r/294464 (owner: 10BBlack) [10:51:05] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:24] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382235 (10Danny_B) It's not classical search though. But on every page, where username is mentioned, this username l... [10:58:46] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 2 failures [10:59:06] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 2 failures [10:59:10] heh that's me [10:59:19] !log rebooting install2001 again [10:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:38] should've known, varnish4 would be different... [11:00:04] (03PS1) 10Faidon Liambotis: lvs: rate-limit more ICMP codes, lower to 1/200ms [puppet] - 10https://gerrit.wikimedia.org/r/294467 [11:00:07] bblack: RFC ^^ [11:00:15] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [11:00:52] I'm not sure at all that I'd like us to do that [11:00:56] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 2 failures [11:01:43] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 2 failures [11:02:12] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 2 failures [11:02:22] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:33] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 2 failures [11:02:33] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [11:02:42] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 2 failures [11:03:02] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures [11:03:03] PROBLEM - Host install2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:04] paravoid: yeah me either, I think even the default ones could break legit traffic if the ratelimit kicks in under normal conditions [11:03:24] e.g. destination unreachable? [11:04:52] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 2 failures [11:04:52] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [11:04:52] PROBLEM - Apache HTTP on mw1275 is CRITICAL: Connection refused [11:05:02] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 2 failures [11:05:02] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 2 failures [11:05:10] (03PS1) 10BBlack: Revert "X-Cache: mark miss->hit_for_pass as pass" [puppet] - 10https://gerrit.wikimedia.org/r/294468 [11:05:12] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 2 failures [11:05:12] RECOVERY - Host install2001 is UP: PING OK - Packet loss = 0%, RTA = 38.83 ms [11:05:36] (03CR) 10BBlack: [C: 032 V: 032] "Fixing this is going to get complicated. v4 backend_response can't set req.*" [puppet] - 10https://gerrit.wikimedia.org/r/294468 (owner: 10BBlack) [11:05:42] PROBLEM - nutcracker port on mw1275 is CRITICAL: Connection refused by host [11:06:01] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [11:06:03] PROBLEM - nutcracker process on mw1275 is CRITICAL: Connection refused by host [11:06:11] 06Operations: Frequent segfaults of rsvg-convert on image scalers - https://phabricator.wikimedia.org/T137876#2382253 (10MoritzMuehlenhoff) [11:06:11] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 2 failures [11:06:22] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:06:31] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 2 failures [11:06:32] PROBLEM - puppet last run on mw1275 is CRITICAL: Connection refused by host [11:06:32] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures [11:06:52] PROBLEM - salt-minion processes on mw1275 is CRITICAL: Connection refused by host [11:07:03] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:07:31] PROBLEM - Check size of conntrack table on mw1275 is CRITICAL: Connection refused by host [11:07:31] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 2 failures [11:07:46] mw1275 is a new appserver, silencing it since it is now in icinga [11:07:51] PROBLEM - DPKG on mw1275 is CRITICAL: Connection refused by host [11:07:53] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:08:02] PROBLEM - Disk space on mw1275 is CRITICAL: Connection refused by host [11:08:11] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 2 failures [11:08:23] PROBLEM - MD RAID on mw1275 is CRITICAL: Connection refused by host [11:08:24] (03PS2) 10Jcrespo: Repool db1033, first pool of db1079, db1086, db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294459 (https://phabricator.wikimedia.org/T133398) [11:08:31] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 2 failures [11:09:02] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 2 failures [11:09:05] it's been a while since we've had such a huge spam of cp* puppetfails, nice to remember they're still here :) [11:09:38] (03CR) 10Jcrespo: [C: 032] Repool db1033, first pool of db1079, db1086, db1094 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294459 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [11:10:23] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 2 failures [11:11:04] (03PS3) 10Giuseppe Lavagetto: mediawiki::jobrunner: fix systemd unit files, logging [puppet] - 10https://gerrit.wikimedia.org/r/294457 [11:11:33] !log enabed firejail wrapper for imagemagick's convert (for image scalers and the Score extension) [11:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:42] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 2 failures [11:12:37] (03PS1) 10BBlack: Revert "X-Cache: fix missing "int" cases, add "err", "bug"" [puppet] - 10https://gerrit.wikimedia.org/r/294469 [11:12:43] (03PS2) 10BBlack: Revert "X-Cache: fix missing "int" cases, add "err", "bug"" [puppet] - 10https://gerrit.wikimedia.org/r/294469 [11:12:51] (03CR) 10BBlack: [C: 032 V: 032] Revert "X-Cache: fix missing "int" cases, add "err", "bug"" [puppet] - 10https://gerrit.wikimedia.org/r/294469 (owner: 10BBlack) [11:12:57] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: fix systemd unit files, logging [puppet] - 10https://gerrit.wikimedia.org/r/294457 (owner: 10Giuseppe Lavagetto) [11:13:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1033, first pool of db1079, db1086, db1094 with low weight (duration: 00m 25s) [11:13:05] (03PS4) 10Giuseppe Lavagetto: mediawiki::jobrunner: fix systemd unit files, logging [puppet] - 10https://gerrit.wikimedia.org/r/294457 [11:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:13:47] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki::jobrunner: fix systemd unit files, logging [puppet] - 10https://gerrit.wikimedia.org/r/294457 (owner: 10Giuseppe Lavagetto) [11:13:52] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 2 failures [11:14:31] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 2 failures [11:15:01] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:15:41] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:15:42] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:15:52] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [11:16:21] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:16:31] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:16:32] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:17:01] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: puppet fail [11:17:21] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:17:22] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:17:52] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:17:52] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:01] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:02] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:18:02] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:02] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:22] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:18:22] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:22] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:18:34] (03PS1) 10Jcrespo: Increase new enwiki dbs weight, depool db1023 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294470 (https://phabricator.wikimedia.org/T133398) [11:18:41] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:18:41] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:41] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:18:42] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:18:43] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:18:51] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:18:53] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:53] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:19:01] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:19:02] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: puppet fail [11:19:02] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:19:21] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:22] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:19:28] (03CR) 10Jcrespo: [C: 032] Increase new enwiki dbs weight, depool db1023 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294470 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [11:19:33] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:41] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:42] RECOVERY - Swift HTTP frontend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.190 second response time [11:19:53] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:53] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:20:22] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:20:41] RECOVERY - Swift HTTP backend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.206 second response time [11:21:35] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/294471 [11:22:48] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase new enwiki dbs weight, depool db1023 for cloning (duration: 00m 27s) [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:47] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/294471 (owner: 10Giuseppe Lavagetto) [11:25:18] (03PS1) 10KartikMistry: apertium-eo-en: New upstream version and Jessie rebuild [debs/contenttranslation/apertium-eo-en] - 10https://gerrit.wikimedia.org/r/294472 (https://phabricator.wikimedia.org/T107306) [11:29:22] !log stopping db1023 for cloning to new s6 hosts [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:30:40] !log change-prop deploying 353b926 [11:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:57] !log scb disabled puppet for 5 min to keep change-prop down [11:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:37:45] (03CR) 10Mobrovac: [C: 031] Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 (owner: 10Thcipriani) [11:38:11] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1023.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1023.eqiad.wmnet (111 Connection refused) [11:38:31] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:39:11] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:39:34] <_joe_> mobrovac: ^^ [11:39:37] <_joe_> what's up? [11:39:40] known [11:39:42] <_joe_> oh I see [11:39:43] <_joe_> sorry [11:39:47] see !log [11:39:48] :) [11:39:50] <_joe_> yes [11:39:56] <_joe_> I saw icinga instead [11:39:59] hehe [11:43:41] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [11:43:58] (03PS1) 10Mobrovac: Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294476 [11:44:02] !log upgrading firejail to 0.9.38 on maps servers [11:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:14] !log scb enabled puppet back [11:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:51] gehel: Why not 0.9.40 [11:47:17] :P [11:47:30] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [11:47:40] Bsadowski1: because .38 is available in Jessie backports. And because that's what moritzm suggested and he knows better than I do ! [11:47:50] Ah, I wouldn't know. [11:48:01] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:48:24] I am testing a Debian installation in a VM... [11:48:31] Debian 8.4 :) [11:48:38] I love it! [11:48:56] Bsadowski1: yes, VMs are cool :P [11:49:09] Bsadowski1: we already have 0.38 on most systems and I'd like to have some consistency before further upgrades (since some options are only made avaialble in newer versions) [11:49:48] Oh I was looking at firejail's site and saw a more recent version [11:50:24] Bsadowski1: we don't typically track latest upstream versions, unless we really need to. [11:51:10] (03CR) 10Ppchelko: [C: 04-1] "I found another unrelated bug that could be uncovered by this, -1 until it's resolved." [puppet] - 10https://gerrit.wikimedia.org/r/294476 (owner: 10Mobrovac) [11:51:14] Well, you know Linux better than I do :) [11:52:01] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.144, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Gehel firejail upgrade in progress [11:52:58] moritzm: kartotherian not restarting after firejail upgrade... checking... [11:54:25] which host? [11:54:45] maps2001.codfw.wmnet. Something about not running as root and tmpfs [11:54:51] I'll have a look [11:55:01] moritzm: nothing urgent, I can check and ping you if I'm lost... [11:55:50] RECOVERY - Apache HTTP on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.006 second response time [11:56:38] moritzm: --tmpfs is now only valid when run as root (according to a diff in man pages) [11:56:50] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:58:01] RECOVERY - nutcracker port on mw1275 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:58:21] RECOVERY - Check size of conntrack table on mw1275 is OK: OK: nf_conntrack is 0 % full [11:58:40] RECOVERY - salt-minion processes on mw1275 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:58:41] indeed, that was changed in 0.9.38 [11:59:02] RECOVERY - Disk space on mw1275 is OK: DISK OK [11:59:02] RECOVERY - MD RAID on mw1275 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:59:11] RECOVERY - nutcracker process on mw1275 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [11:59:28] (03PS1) 10BBlack: VCL: prevent inter-cache loop bugs [puppet] - 10https://gerrit.wikimedia.org/r/294478 (https://phabricator.wikimedia.org/T134404) [11:59:30] moritzm: something I do not understand about systemd units... both tilerator and kartotherian have a service definition that says "User=kartotherian" (or "tilerator"), but "ps axu | grep firejail" let me think that firejail is indeed running as root for tilerator [11:59:42] * gehel does not understand much about systemd... [12:01:50] that's just the firejail process itself (which needs elevated privileges to setup namespaces etc.), the actual processes are run with user privileges (such as tilerator) [12:02:08] RECOVERY - DPKG on mw1275 is OK: All packages OK [12:02:22] moritzm: right, firejail is setuid ! [12:03:05] I'll look into that and prepare a patch for service::node [12:03:42] moritzm: thanks! [12:05:55] (03PS2) 10BBlack: VCL: prevent inter-cache loop bugs with X-DCPath [puppet] - 10https://gerrit.wikimedia.org/r/294478 (https://phabricator.wikimedia.org/T134404) [12:06:05] !log upgrade of firejail on maps server stopped, pending a patch to service::node [12:06:08] PROBLEM - DPKG on mw1276 is CRITICAL: Timeout while attempting connection [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:28] PROBLEM - Disk space on mw1276 is CRITICAL: Timeout while attempting connection [12:06:46] ^note that maps2001.codfw.wmnet already has latest firejail version, kartotherian is stopped and not starting. [12:06:49] PROBLEM - MD RAID on mw1276 is CRITICAL: Timeout while attempting connection [12:07:49] PROBLEM - Apache HTTP on mw1276 is CRITICAL: Connection timed out [12:07:49] PROBLEM - configured eth on mw1276 is CRITICAL: Timeout while attempting connection [12:08:19] PROBLEM - dhclient process on mw1276 is CRITICAL: Timeout while attempting connection [12:08:29] PROBLEM - mediawiki-installation DSH group on mw1276 is CRITICAL: Host mw1276 is not in mediawiki-installation dsh group [12:08:58] PROBLEM - nutcracker port on mw1276 is CRITICAL: Timeout while attempting connection [12:09:18] PROBLEM - nutcracker process on mw1276 is CRITICAL: Timeout while attempting connection [12:09:28] PROBLEM - puppet last run on mw1276 is CRITICAL: Timeout while attempting connection [12:09:48] PROBLEM - salt-minion processes on mw1276 is CRITICAL: Timeout while attempting connection [12:10:19] PROBLEM - Check size of conntrack table on mw1276 is CRITICAL: Timeout while attempting connection [12:15:53] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382314 (10Paladox) @Danny_B ok, would you be able to add the rules to git.wmflabs.org please. I know your working on... [12:16:13] 06Operations, 10Traffic, 13Patch-For-Review, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2382315 (10BBlack) Some further thinking: without changing the cache-level stuff discussed above, this would also support a config like: ``` restbase... [12:17:33] I think hashar is not around, anybody knows what is the current status of gallium [12:17:54] ? I have some backups on some db nodes I want to put into production today [12:18:28] greg-g ^^ [12:19:02] I can move them to, e.g. einstinium, but only if they are useful [12:21:35] (03PS3) 10BBlack: Kartotherian: Set up LVS checks [puppet] - 10https://gerrit.wikimedia.org/r/294454 (https://phabricator.wikimedia.org/T137851) (owner: 10Mobrovac) [12:21:45] (03CR) 10BBlack: [C: 032 V: 032] Kartotherian: Set up LVS checks [puppet] - 10https://gerrit.wikimedia.org/r/294454 (https://phabricator.wikimedia.org/T137851) (owner: 10Mobrovac) [12:23:32] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2382317 (10BBlack) As noted in the commitmsg above, we should figure out icinga contactgroup stuff for this, too. Who is the correct team to get the alerts? [12:23:58] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2382319 (10mobrovac) [12:24:15] (03CR) 10Dr0ptp4kt: "Just wanted to acknowledge I've seen the patchset and looked a bit at the ZeroBanner extension lines pertaining to it. I'll need several d" [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [12:25:30] (03CR) 10BBlack: [C: 031] Force Varnishkafka to filter out Varnish Pipe logs [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [12:25:55] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: unquote user/group directives [puppet] - 10https://gerrit.wikimedia.org/r/294482 [12:27:55] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: unquote user/group directives [puppet] - 10https://gerrit.wikimedia.org/r/294482 (owner: 10Giuseppe Lavagetto) [12:29:08] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2382336 (10mobrovac) [12:37:12] <_joe_> !log rebooting mw1299 [12:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:39] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:24] can someone take a look at graphite? grafana.wikimedia.org shows actually no data, for example here: https://grafana.wikimedia.org/dashboard/db/labs-project-board?var-project=rcm&var-server=All&theme=dark [12:41:48] PROBLEM - nutcracker process on mw1299 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [12:41:50] and nagf, which depends on graphite, too [12:42:48] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time [12:42:58] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:44:29] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.004 second response time [12:45:02] (03PS1) 10Muehlenhoff: Remove --tmpfs option in service::node and zotero [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) [12:45:30] PROBLEM - nutcracker port on mw1299 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [12:46:18] PROBLEM - NTP on mw1276 is CRITICAL: NTP CRITICAL: Offset unknown [12:48:18] RECOVERY - nutcracker process on mw1276 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:48:49] RECOVERY - salt-minion processes on mw1276 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:49:09] RECOVERY - configured eth on mw1276 is OK: OK - interfaces up [12:49:18] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 18.239 second response time [12:49:18] RECOVERY - Check size of conntrack table on mw1276 is OK: OK: nf_conntrack is 0 % full [12:49:25] (03CR) 10Ottomata: [C: 031] "Haven't tested but sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [12:49:29] RECOVERY - dhclient process on mw1276 is OK: PROCS OK: 0 processes with command name dhclient [12:49:58] RECOVERY - Disk space on mw1276 is OK: DISK OK [12:50:10] RECOVERY - nutcracker port on mw1276 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:50:28] (03PS4) 10Elukey: Force Varnishkafka to filter out Varnish Pipe logs [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) [12:50:29] RECOVERY - MD RAID on mw1276 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:51:00] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [12:51:14] mw127[67] are new appservers, silencing now! [12:51:29] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:52:15] (03CR) 10Elukey: [C: 032] Force Varnishkafka to filter out Varnish Pipe logs [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [12:53:48] anonymous save on enwiki in VE shows "docserver-http: HTTP 404" [12:53:48] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 14.774 second response time [12:54:18] RECOVERY - DPKG on mw1276 is OK: All packages OK [12:55:10] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: unquote EnvironmentFile too. [puppet] - 10https://gerrit.wikimedia.org/r/294484 [12:55:12] (03PS1) 10Giuseppe Lavagetto: scap: add mw1299 to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/294485 [12:58:17] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: unquote EnvironmentFile too. [puppet] - 10https://gerrit.wikimedia.org/r/294484 (owner: 10Giuseppe Lavagetto) [12:58:44] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: add mw1299 to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/294485 (owner: 10Giuseppe Lavagetto) [13:00:39] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:27] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.026 second response time [13:02:41] (03PS1) 10Gehel: Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 [13:02:46] RECOVERY - NTP on mw1276 is OK: NTP OK: Offset -0.004576444626 secs [13:03:16] RECOVERY - nutcracker port on mw1299 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:04:56] RECOVERY - nutcracker process on mw1299 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:05:16] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:06:14] !log installing libav security updates [13:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:36] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:13:56] RECOVERY - mediawiki-installation DSH group on mw1299 is OK: OK [13:15:39] !log change-prop deployed 6ad337 [13:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:02] (03PS1) 10Ladsgroup: Deploy ORES beta feature in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294488 (https://phabricator.wikimedia.org/T130212) [13:16:25] <_joe_> !log stopped jobchron, jobrunner on mw1299, masked in systemd [13:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:39] (03CR) 10jenkins-bot: [V: 04-1] Deploy ORES beta feature in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294488 (https://phabricator.wikimedia.org/T130212) (owner: 10Ladsgroup) [13:16:42] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2382462 (10BBlack) @ori @Krinkle - any thoughts or pointers on getting this tested more-broadly and then switching before the cert expiry date? [13:16:55] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:18:55] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 2 failures [13:19:10] (03PS2) 10Ladsgroup: Deploy ORES beta feature in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294488 (https://phabricator.wikimedia.org/T130212) [13:20:27] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2382472 (10faidon) The CPU issue has been alleviated, but the packet loss issue remains, at the previous levels of 0.5-5%. It's unlikely but this might be an entirely different issue altogether — @papaul, could... [13:23:41] (03CR) 10Ppchelko: [C: 031] "The bug was fixed by https://github.com/wikimedia/change-propagation/pull/57 so this is fine now." [puppet] - 10https://gerrit.wikimedia.org/r/294476 (owner: 10Mobrovac) [13:23:46] (03PS1) 10Filippo Giunchedi: install_server: ms-be300* to jessie [puppet] - 10https://gerrit.wikimedia.org/r/294489 (https://phabricator.wikimedia.org/T117972) [13:23:48] (03PS1) 10Filippo Giunchedi: lvs: add icinga config for ms-fe in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/294490 [13:24:12] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2381946 (10Joe) We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we do for other services, see: https://phabricator.wiki... [13:27:32] (03PS2) 10Filippo Giunchedi: install_server: ms-be300* to jessie [puppet] - 10https://gerrit.wikimedia.org/r/294489 (https://phabricator.wikimedia.org/T117972) [13:27:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: ms-be300* to jessie [puppet] - 10https://gerrit.wikimedia.org/r/294489 (https://phabricator.wikimedia.org/T117972) (owner: 10Filippo Giunchedi) [13:28:19] 06Operations: ffmpeg/libav on jessie video scalers - https://phabricator.wikimedia.org/T137886#2382493 (10MoritzMuehlenhoff) [13:33:17] (03PS2) 10Giuseppe Lavagetto: Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294476 (owner: 10Mobrovac) [13:34:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294476 (owner: 10Mobrovac) [13:34:46] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:35:32] <_joe_> mobrovac: running puppet on scb* [13:35:40] kk [13:36:29] Hi it seems https://gerrit-dev.wmflabs.org/ is down. [13:36:32] apche2 stop [13:36:38] apache2 stopped [13:40:22] !log rolling back update of firejail on maps2001 [13:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:34] (03PS2) 10Faidon Liambotis: lvs: rate-limit more ICMP codes, lower to 1/200ms [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) [13:41:52] (03CR) 10Faidon Liambotis: "I'm not sure at all that we should do that -- comments welcome!" [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) (owner: 10Faidon Liambotis) [13:42:04] 06Operations, 10cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2382536 (10Eevans) >>! In T136340#2380227, @RobH wrote: > In reviewing this request, it isn't clear to me how the administration of these machines would exist. > > Would the... [13:44:47] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.900 second response time [13:46:51] (03CR) 10Mobrovac: [C: 031] Remove --tmpfs option in service::node and zotero [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) (owner: 10Muehlenhoff) [13:47:16] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.795 second response time [13:49:42] (03CR) 10Anomie: [C: 031] Send authentication events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294462 (owner: 10Gergő Tisza) [13:50:22] (03CR) 10BBlack: "Without doing some research first, tbh I'm not sure if the default ratelimits are already hurting us in some cases. Don't we need some of" [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) (owner: 10Faidon Liambotis) [13:55:52] !log remove unused PHP packages from the recently provisioned jessie app servers (new installation are fixed in puppet to only install php5-cli, but the initial set needs fixed up manually) [13:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:36] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:05:13] PROBLEM - Apache HTTP on mw1277 is CRITICAL: Connection timed out [14:06:10] 06Operations, 06Services, 10Traffic: Define a standardized config mechanism for exposing services through varnish - https://phabricator.wikimedia.org/T110717#2382584 (10BBlack) [14:06:12] PROBLEM - nutcracker process on mw1277 is CRITICAL: Timeout while attempting connection [14:06:41] PROBLEM - puppet last run on mw1277 is CRITICAL: Timeout while attempting connection [14:07:11] PROBLEM - salt-minion processes on mw1277 is CRITICAL: Timeout while attempting connection [14:07:48] (03PS1) 10BBlack: r::c::instances: remove dead option app_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/294494 [14:07:50] (03PS1) 10BBlack: [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) [14:07:51] PROBLEM - Check size of conntrack table on mw1277 is CRITICAL: Timeout while attempting connection [14:08:02] PROBLEM - DPKG on mw1277 is CRITICAL: Timeout while attempting connection [14:08:32] PROBLEM - Disk space on mw1277 is CRITICAL: Timeout while attempting connection [14:08:52] PROBLEM - MD RAID on mw1277 is CRITICAL: Timeout while attempting connection [14:09:34] (03CR) 10jenkins-bot: [V: 04-1] [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:09:41] --^ new appserver, silencing [14:09:41] PROBLEM - configured eth on mw1277 is CRITICAL: Timeout while attempting connection [14:10:13] (03CR) 10JanZerebecki: [C: 032] Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [14:11:09] (03PS7) 10JanZerebecki: Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [14:11:20] (03CR) 10JanZerebecki: [C: 032] Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [14:11:58] (03CR) 10BBlack: [C: 032] r::c::instances: remove dead option app_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/294494 (owner: 10BBlack) [14:12:14] (03Merged) 10jenkins-bot: Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [14:12:43] (03PS1) 10Gergő Tisza: Handle invalid DB name in 'sql' shell script [puppet] - 10https://gerrit.wikimedia.org/r/294496 [14:13:41] PROBLEM - mediawiki-installation DSH group on mw1275 is CRITICAL: Host mw1275 is not in mediawiki-installation dsh group [14:14:13] (03PS2) 10BBlack: [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) [14:15:25] (03CR) 10jenkins-bot: [V: 04-1] [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:19:48] (03PS1) 10Elukey: Add mw127[56] to the Mediawiki scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/294497 [14:22:07] (03CR) 10Elukey: [C: 032] Add mw127[56] to the Mediawiki scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/294497 (owner: 10Elukey) [14:22:55] (03PS1) 10Gehel: Adding a "interactive-team" icinga group for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) [14:23:20] (03CR) 10Anomie: [C: 031] "Seems sane. I can't +2 in this repo though." [puppet] - 10https://gerrit.wikimedia.org/r/294496 (owner: 10Gergő Tisza) [14:23:53] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:23:53] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:24:16] !log installing php security updates on jessie systems [14:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:45] (03CR) 10Gehel: [C: 04-1] "Do not merge before appropriate contacts have been created in private git repo" [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [14:26:11] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [14:29:35] (03PS1) 10Mobrovac: service::node: Output stdout and stderr seen by firejail to a log file [puppet] - 10https://gerrit.wikimedia.org/r/294499 [14:30:36] moritzm: mind taking a look at ^^ and let me know if it makes sense? [14:30:52] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [14:32:20] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2382319 (10Krenair) I looked through some of the other -admin groups. Are any of these also relevant? ```gerrit-admin ocg-render-admins analytics-admins chromium-admin zoter... [14:32:22] PROBLEM - DPKG on mw2225 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:32:23] PROBLEM - DPKG on mw2227 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:11] PROBLEM - DPKG on mw2216 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:11] PROBLEM - DPKG on mw2217 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:11] PROBLEM - DPKG on mw2231 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:22] PROBLEM - DPKG on mw2221 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:22] PROBLEM - DPKG on mw2218 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:22] PROBLEM - DPKG on mw2228 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:22] PROBLEM - DPKG on mw2219 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:22] PROBLEM - DPKG on mw2215 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:23] PROBLEM - DPKG on mw2223 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:33:32] PROBLEM - DPKG on mw1269 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:34:02] PROBLEM - DPKG on mw2230 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:34:04] mobrovac: I'll have a look [14:34:07] thnx [14:34:13] PROBLEM - DPKG on mw2220 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:34:21] PROBLEM - DPKG on mw1267 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:34:31] PROBLEM - DPKG on mw2226 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:34:52] PROBLEM - DPKG on mw1271 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:34:52] PROBLEM - DPKG on mw2222 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:35:02] PROBLEM - DPKG on mw1262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:35:09] looking into the dpkg errors [14:35:42] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.012 second response time [14:35:43] (03PS3) 10BBlack: [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) [14:35:52] PROBLEM - DPKG on mw2224 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:35:52] PROBLEM - DPKG on mw2229 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:36:12] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures [14:36:22] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [14:36:32] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:37:01] PROBLEM - DPKG on mw1266 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:37:01] PROBLEM - puppet last run on mw1291 is CRITICAL: CRITICAL: Puppet has 1 failures [14:37:02] PROBLEM - DPKG on mw2232 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:37:21] PROBLEM - DPKG on mw1291 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:37:22] PROBLEM - DPKG on mw1263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:38:01] PROBLEM - DPKG on mw1268 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:38:02] RECOVERY - DPKG on mw2225 is OK: All packages OK [14:38:02] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:11] RECOVERY - DPKG on mw2224 is OK: All packages OK [14:38:41] PROBLEM - DPKG on mw1265 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:38:41] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures [14:39:01] PROBLEM - DPKG on mw1270 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:39:41] RECOVERY - configured eth on mw1277 is OK: OK - interfaces up [14:39:42] RECOVERY - DPKG on mw2223 is OK: All packages OK [14:40:01] RECOVERY - DPKG on mw1269 is OK: All packages OK [14:40:12] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 1 failures [14:40:21] RECOVERY - Check size of conntrack table on mw1277 is OK: OK: nf_conntrack is 0 % full [14:40:22] PROBLEM - DPKG on mw1264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:40:22] RECOVERY - DPKG on mw2229 is OK: All packages OK [14:40:42] RECOVERY - nutcracker process on mw1277 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:40:42] RECOVERY - Disk space on mw1277 is OK: DISK OK [14:40:42] RECOVERY - DPKG on mw1277 is OK: All packages OK [14:40:53] RECOVERY - salt-minion processes on mw1277 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:40:53] RECOVERY - DPKG on mw1265 is OK: All packages OK [14:40:53] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures [14:41:22] RECOVERY - MD RAID on mw1277 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:41:35] (03CR) 10Danny B.: git.wikimedia.org -> Diffusion redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [14:41:52] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 1 failures [14:41:53] RECOVERY - DPKG on mw2228 is OK: All packages OK [14:42:11] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 1 failures [14:42:33] RECOVERY - DPKG on mw2227 is OK: All packages OK [14:43:22] RECOVERY - DPKG on mw2226 is OK: All packages OK [14:43:52] RECOVERY - DPKG on mw1266 is OK: All packages OK [14:43:53] RECOVERY - DPKG on mw2231 is OK: All packages OK [14:43:53] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures [14:44:04] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2382635 (10Gehel) LVS endpoint [[ https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=kartotherian.svc.codfw.wmnet |are checked ]]. The `chec... [14:44:12] RECOVERY - DPKG on mw2221 is OK: All packages OK [14:44:42] RECOVERY - DPKG on mw1268 is OK: All packages OK [14:44:52] RECOVERY - DPKG on mw1264 is OK: All packages OK [14:45:12] RECOVERY - DPKG on mw2230 is OK: All packages OK [14:45:32] RECOVERY - DPKG on mw1267 is OK: All packages OK [14:46:21] RECOVERY - DPKG on mw1262 is OK: All packages OK [14:46:22] RECOVERY - DPKG on mw1263 is OK: All packages OK [14:46:35] (03CR) 10Danny B.: [C: 04-1] git.wikimedia.org -> Diffusion redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [14:46:52] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 1 failures [14:47:12] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet has 1 failures [14:47:18] (03PS4) 10BBlack: [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) [14:47:32] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Puppet has 1 failures [14:47:40] (03PS1) 10Mobrovac: Revert "Revert "Change Prop: Disable transclusion update rules"" [puppet] - 10https://gerrit.wikimedia.org/r/294500 [14:47:52] RECOVERY - DPKG on mw1261 is OK: All packages OK [14:48:02] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 1 failures [14:48:21] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet has 1 failures [14:49:45] (03CR) 10Ppchelko: [C: 031] "Ohhhh...." [puppet] - 10https://gerrit.wikimedia.org/r/294500 (owner: 10Mobrovac) [14:50:13] RECOVERY - DPKG on mw1270 is OK: All packages OK [14:50:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Revert "Change Prop: Disable transclusion update rules"" [puppet] - 10https://gerrit.wikimedia.org/r/294500 (owner: 10Mobrovac) [14:50:32] RECOVERY - DPKG on mw1271 is OK: All packages OK [14:50:52] RECOVERY - DPKG on mw1291 is OK: All packages OK [14:51:02] RECOVERY - DPKG on mw2215 is OK: All packages OK [14:51:15] (03PS5) 10BBlack: [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) [14:52:53] RECOVERY - DPKG on mw2216 is OK: All packages OK [14:52:53] RECOVERY - DPKG on mw2217 is OK: All packages OK [14:53:13] RECOVERY - DPKG on mw2218 is OK: All packages OK [14:53:21] RECOVERY - DPKG on mw2219 is OK: All packages OK [14:54:33] RECOVERY - DPKG on mw2220 is OK: All packages OK [14:55:14] RECOVERY - DPKG on mw2222 is OK: All packages OK [14:55:14] RECOVERY - DPKG on mw2232 is OK: All packages OK [15:00:04] anomie, ostriches, thcipriani, marktraceur, and Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160615T1500). Please do the needful. [15:00:05] Urbanecm, Amir1, and tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:05] !log rebooting Eqiad Event Bus for kernel upgrades (one node at the time) [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:13] o/ [15:00:18] around [15:00:18] o/ [15:00:31] !log scb disabled puppet for stopped change-prop during kafka nodes upgrade [15:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:46] !log elukey@palladium conftool action : set/pooled=no; selector: kafka1001.eqiad.wmnet [15:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:15] I can SWAT today [15:01:31] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:01:58] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2382650 (10Tobi_WMDE_SW) @aklapper if possible, could you have a look at @lea_wmde's request in T706#2364612? [15:02:01] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:02:04] (03PS2) 10Thcipriani: Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293970 (owner: 10Urbanecm) [15:02:19] thcipriani, ping me when finished if possible, please [15:02:23] RECOVERY - puppet last run on mw1291 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:02:27] jynus: will do [15:02:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293970 (owner: 10Urbanecm) [15:02:31] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:02:32] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:02:47] known ^ [15:03:04] (03Merged) 10jenkins-bot: Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293970 (owner: 10Urbanecm) [15:03:22] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:04:01] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:05:22] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:06:12] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:06:24] (03PS1) 10Jcrespo: Repool db1023; pool db1085 (disabled), db1088, db1092 w/low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294502 (https://phabricator.wikimedia.org/T133398) [15:06:36] (03PS2) 10Filippo Giunchedi: lvs: add icinga config for ms-fe in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/294490 [15:06:39] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:293970|Remove old throttle rules]] (duration: 00m 30s) [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "PCC: https://puppet-compiler.wmflabs.org/3118/" [puppet] - 10https://gerrit.wikimedia.org/r/294490 (owner: 10Filippo Giunchedi) [15:06:49] ^ Urbanecm sync'd thank you for the cleanup [15:07:00] Thanks for deploying [15:07:11] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:07:22] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:08:01] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [15:08:23] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [15:08:43] Amir1: I'm going to come back to your changes once zuul has merged the change to wmf.5 (since that takes a few minutes) [15:08:46] (03PS1) 10Gehel: Add interactive-team to default Icinga notification group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/294503 (https://phabricator.wikimedia.org/T137869) [15:08:55] thcipriani: sure, thanks :) [15:09:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293701 (owner: 10Gergő Tisza) [15:10:07] (03CR) 10Gehel: "@jgirault: do you want to also be notified? Is there anyone else who should be in the interactive-team?" [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [15:10:10] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka1001.eqiad.wmnet [15:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:20] (03Merged) 10jenkins-bot: Fix logging config for authmanager metrics channel rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293701 (owner: 10Gergő Tisza) [15:11:12] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:11:51] (03PS1) 10Elukey: Add mw1277 to the MW scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/294504 [15:11:52] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:55] !log thcipriani@tin Synchronized wmf-config/logging.php: SWAT: [[gerrit:293701|Fix logging config for authmanager metrics channel rename]] (duration: 00m 24s) [15:11:58] tgr: ^ [15:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:13] RECOVERY - mediawiki-installation DSH group on mw1276 is OK: OK [15:12:39] (03PS2) 10Thcipriani: Send authentication events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294462 (owner: 10Gergő Tisza) [15:13:02] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:13:16] (03CR) 10Elukey: [C: 032] Add mw1277 to the MW scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/294504 (owner: 10Elukey) [15:13:22] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:14:25] thcipriani: works [15:14:32] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:48] tgr: cool, thanks for checking [15:14:51] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:52] RECOVERY - mediawiki-installation DSH group on mw1275 is OK: OK [15:15:03] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294462 (owner: 10Gergő Tisza) [15:15:16] (03PS1) 10Muehlenhoff: update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/294505 [15:15:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/294505 (owner: 10Muehlenhoff) [15:15:37] (03Merged) 10jenkins-bot: Send authentication events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294462 (owner: 10Gergő Tisza) [15:15:38] !log elukey@palladium conftool action : set/pooled=no; selector: kafka1002.eqiad.wmnet [15:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:25] afk for a min [15:17:22] back [15:17:41] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294462|Send authentication events to logstash]] (duration: 00m 28s) [15:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:48] ^ tgr check please [15:18:03] (03PS3) 10Thcipriani: Deploy ORES beta feature in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294488 (https://phabricator.wikimedia.org/T130212) (owner: 10Ladsgroup) [15:18:49] thcipriani: that works too. Thanks! [15:19:01] tgr: great, thanks! [15:19:31] thcipriani: is the wmf.5 patch deployed? because we need to first deploy it and then deploy the config patch :) [15:19:49] Amir1: doing the wmf.5 patch now [15:20:04] thanks :) [15:23:04] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/ORES: SWAT: [[gerrit:294491|Skip when an edit is errored in PopulateDatabase.php]] (duration: 00m 27s) [15:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:24] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka1002.eqiad.wmnet [15:23:24] Amir1: ^ sync'd, anything to test there? [15:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:33] thcipriani: nope [15:23:39] kk [15:24:01] so wikidatawiki, remind me of the order of operations for maintenance scripts? [15:24:19] thcipriani: first we need to make tables [15:24:24] 1- ores_model [15:24:31] 2- ores_classification [15:24:46] https://github.com/wikimedia/mediawiki-extensions-ORES/tree/master/sql [15:25:19] Hi. [15:25:36] Urbanecm: if I've well understood the situation, ko. did a NEW discussion and gathered more support than the first time? [15:28:28] Amir1: ok, done. [15:28:42] thcipriani: awesome [15:29:00] now: mwscript extensions/ORES/maintenance/CheckModelVersions.php [15:29:06] Amir1: then sync config change, then it's...checkmodelversions then populatedatabase [15:29:15] yup [15:29:21] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [15:29:29] kk, going with config change [15:29:43] awesome, Tell me when you want to do SWAT (for the next deployments :D) [15:29:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294488 (https://phabricator.wikimedia.org/T130212) (owner: 10Ladsgroup) [15:30:27] (03Merged) 10jenkins-bot: Deploy ORES beta feature in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294488 (https://phabricator.wikimedia.org/T130212) (owner: 10Ladsgroup) [15:31:01] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [15:31:08] for wikidatawiki, if the populatedatabase doesn't take a very long time, run it twice [15:33:38] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294488|Deploy ORES beta feature in wikidatawiki]] (duration: 00m 24s) [15:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:25] Amir1: ran CheckModelVersions fine, PopulateDatabase is running now. [15:35:36] 06Operations, 06Discovery, 06Services, 03Discovery-Maps-Sprint: Allow configuration of contact groups for monitoring of services - https://phabricator.wikimedia.org/T137891#2382756 (10Gehel) [15:35:36] nice [15:35:38] thanks :) [15:36:10] Lydia is sitting next to me [15:36:14] waiting :D [15:36:22] Dereckson: Well, the first discussion was started in 2016 (https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC%ED%86%A0%EB%A1%A0:%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC_%EC%A0%95%EB%B9%84%EB%8B%A8/%EB%B3%B4%EC%A1%B42#.EC.83.88_.EB.AC.B8.EC.84.9C_.EB.AA.A9.EB.A1.9D_.EA.B2.80.ED.86.A0_.EA.B8.B0.EB.8A.A5 ). Then ko did (in 2012) a "technical review", see https://ko.wikipedia.org/wiki/%EC% [15:36:24] 9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC%ED%86%A0%EB%A1%A0:%EC%82%AC%EC%9A%A9%EC%9E%90_%EA%B6%8C%ED%95%9C#.EC.9E.90.EB.8F.99_.EA.B2.80.ED.86.A0_.EA.B8.B0.EB.8A.A5. [15:36:26] In May 2016 kowiki did a final voting (https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC%ED%86%A0%EB%A1%A0:%EC%83%88_%EB%AC%B8%EC%84%9C_%EC%A0%90%EA%B2%80#.EC.A0.90.EA.B2.80_.EA.B8.B0.EB.8A.A5_.EB.8F.84.EC.9E.85 ). In this voting 11 users agree with autopatrolled group and one disagree with autopatrolled group. [15:36:30] Kowiki has about 100 very active users (100+ edits per month) and if 12 participated, I think that we can accept this as consensus. [15:36:30] Anyway, autopatrol group will affect patrollers only (I dont mean a group but users which makes pages as patrolled manually now), so they wont have to patrol any new article from trusted users manually. I dont know how many users do this in ko but in cswiki (similar in size) this is on about five users. [15:36:31] If youll send a email to martin.urbanec@wikimedia.cz, I can forward email from author of the request if you want. [15:37:07] (03PS1) 10Gehel: Add the ability to configure contact group for check of services. [puppet] - 10https://gerrit.wikimedia.org/r/294507 (https://phabricator.wikimedia.org/T137869) [15:37:28] Dereckson: I dont know about any new voting or discussion. [15:37:52] thcipriani: do you know how much is left? [15:38:04] are you running the populate script [15:38:42] Amir1: yes it is still running, not sure how much is left, 'Processing 50 revsisions' is what I see [15:39:42] awesome, It would be great if do a "select count(*) from ores_classification" [15:40:42] Urbanecm: we define active users at 5+ contribs per months in our stats [15:40:47] Amir1: 7359 currently [15:40:53] Amir1: just finished [15:41:00] awesome [15:41:01] thanks [15:41:06] 7646 final [15:41:06] I'm testing in real time [15:41:19] have you run it once or twice? [15:41:26] just once [15:42:14] okay, everything looks good but it's a little bit slow [15:42:19] https://www.wikidata.org/w/index.php?title=Special:RecentChanges&hidenondamaging=1 [15:42:22] Dereckson: I know that active user is one who has 5+ contribs per month but there was something like very active user. This very active user is one who has 100+ contribs per month. And I was talking about very active users. [15:42:40] I'm not sure if that would be fixed or I need to work on performance [15:43:31] Urbanecm: I see my comment contained "less than ten persons seem to have participated", so yes, they gathered more support, and apparently no objection aftewards [15:43:45] hi folks. still swatting? is there time left for one more patch? [15:44:05] So do you think that we can deploy my patch? Will you change your -1? [15:44:19] Amir1: anything else needed for SWAT? [15:44:21] (03CR) 10Gehel: [C: 04-1] "Waiting for alerting to be sorted out." [puppet] - 10https://gerrit.wikimedia.org/r/294390 (https://phabricator.wikimedia.org/T137848) (owner: 10MaxSem) [15:44:30] MatmaRex: I am still swatting, what do you have in mind? [15:44:51] Yes, it was why I asked you more details. To see if we can deploy that. [15:44:52] thcipriani: nope, I will let you know [15:44:59] Amir1: ack, thank you! [15:45:14] thcipriani: cherry-pick of https://gerrit.wikimedia.org/r/294349 to the latest branch to fix https://phabricator.wikimedia.org/T137535 . i though it made it, but it didn't [15:45:20] Apparently, their discussion isn't hidden and 3 months to say "no we don't want patrol" is enough, okay we can proceed [15:45:48] (03PS1) 10Andrew Bogott: Move labspuppetbackend to port 8100; open firewall [puppet] - 10https://gerrit.wikimedia.org/r/294510 (https://phabricator.wikimedia.org/T133412) [15:45:56] Ok. Thanks. [15:46:19] MatmaRex: sure, we can do that. Do you have the cherry-pick or should I make one? [15:47:07] (03CR) 10Dereckson: [C: 031] "Per discussion between Urbanecm and the author, and then a discussion on #wikimedia-operations, it seems this consensus is enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) (owner: 10Urbanecm) [15:47:14] (03PS2) 10Dereckson: Add autopatrolled group in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) (owner: 10Urbanecm) [15:47:18] thcipriani: no, i just realized it's not fixed. is everyone else's swat done? [15:47:30] if not, i'll just add it to the end of the list [15:47:35] MatmaRex: Urbanecm wants perhaps to add 294254 [15:47:40] MatmaRex: yup all other SWAT is done. [15:47:56] :o neat. [15:48:04] (03CR) 10Andrew Bogott: [C: 032] Move labspuppetbackend to port 8100; open firewall [puppet] - 10https://gerrit.wikimedia.org/r/294510 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [15:48:17] thcipriani: wmf.6 is https://gerrit.wikimedia.org/r/#/c/294511/ [15:48:23] i'll update the Deployments page for paper trail [15:48:51] MatmaRex: could you add * [config] {{Gerrit|294254}} Add autopatrolled group in kowiki ({{phabT|130808}}) too by the same occasion? [15:49:06] If adding of my patch is possible... [15:49:40] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2382834 (10Aklapper) Ah sorry. @Lea_WMDE: I've added you. [15:49:46] Urbanecm: sure :) [15:49:46] sure [15:49:52] thanks MatmaRex [15:50:13] Thanks :) . [15:50:32] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [15:50:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) (owner: 10Urbanecm) [15:51:32] (03Merged) 10jenkins-bot: Add autopatrolled group in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) (owner: 10Urbanecm) [15:52:48] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382844 (10mmodell) >>! In T137224#2381927, @Joe wrote: > @20after4 do you think we're at the point where we can inst... [15:52:58] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294254|Add autopatrolled group in kowiki]] (duration: 00m 24s) [15:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:03] ^ Urbanecm check please [15:54:21] In Special:UserGroupRights there is "0 (0)". I dont know what this means. [15:55:19] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2382846 (10Tobi_WMDE_SW) thanks @aklapper! [15:55:44] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2382847 (10Papaul) @Faidon cable replacement complete [15:55:49] It's in Czech and English translation of the page (using ?uselang=). Ping: Thcipriani [15:56:26] yeah, I see it, I've not seen that either :\ [15:56:38] And when you add ?uselang=qqx ? [15:57:22] (listgrouprights-right-display: (right-0), 0) [15:57:32] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2382855 (10RobH) 05Open>03stalled a:03RobH Since this is a sudo/admin request, it will require approval in the weekly operations meeti... [15:59:59] Best guess is there isn't any right in the autopatrolled group [16:00:09] 06Operations, 10Traffic, 13Patch-For-Review, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2264674 (10GWicke) >>! In T134404#2382315, @BBlack wrote: > Where `restbase.svc.wmnet` is defined in gdnsd and uses the closest underlying service end... [16:00:17] !log thcipriani@tin Synchronized php-1.28.0-wmf.6/resources/src/mediawiki.special/mediawiki.special.search.styles.css: SWAT: [[gerrit:294511|Explicitly specify the width of the search input on Special:Search]] (duration: 00m 25s) [16:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:22] jynus: hey, if you're around we need a DBA advice on a performance issue [16:00:23] ^ MatmaRex check please [16:00:40] I saw no lag created by it, I am not worried [16:00:42] (03PS1) 10Andrew Bogott: Have labspuppetbackend listen on an external IP rather than 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/294513 (https://bugzilla.wikimedia.org/294510) [16:00:43] ah [16:00:46] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2382882 (10RobH) 05Open>03stalled a:03RobH The scope of this is fairly large, but the request/change itself is a single permission (ability to read that log file) for... [16:00:50] ah? [16:00:51] thcipriani: thanks, looks good on mw.org [16:00:51] I've understood the issue [16:00:52] * Dereckson prepares a fix [16:01:00] jynus: It will create very soon [16:01:09] it will create lag? [16:01:16] I think so [16:01:16] What's wrong with my patch? [16:01:21] how? [16:02:18] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2382892 (10Antigng_) >>! In T137707#2380275, @jcrespo wrote: > BTW, the API is definitely faster, one just need to use it efficiently: > > > ``` > $ time curl 'htt... [16:03:35] (03PS1) 10Dereckson: Fix autopatrolled group for ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294514 (https://phabricator.wikimedia.org/T130808) [16:03:37] Urbanecm: look this fix ^ [16:03:51] You added a string in the array. [16:04:01] This string is a value, without any key associated [16:04:02] 06Operations, 10cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2382911 (10RobH) a:03RobH I think that covers the basics! I'll create a #procurement task as a blocker to this, and gather some pricing info. [16:04:06] so it receives the key 0 [16:04:22] your patch was the same than 0 => 'autopatrolled' [16:04:33] Oh. Thanks for the fix. [16:04:50] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2382916 (10RobH) So the task should be for 5 disks. That will put 2 into immediate use and 3 on the shelf. [16:04:51] as 'autopatrolled' is a non zero, non empty string, PHP interpreted that as true, when converted as a boolean [16:05:00] so it accepted the right, which was the key 0 [16:05:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294514 (https://phabricator.wikimedia.org/T130808) (owner: 10Dereckson) [16:05:32] Dereckson: thank you for the quick fix [16:05:32] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2382917 (10RobH) [16:05:37] and that's why the array on the special page tried to offer a right "0" [16:05:40] You're welcome. [16:05:45] 06Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785#2382919 (10RobH) [16:05:49] (03Merged) 10jenkins-bot: Fix autopatrolled group for ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294514 (https://phabricator.wikimedia.org/T130808) (owner: 10Dereckson) [16:05:55] Which doesn't exist :) [16:06:10] (03PS1) 10Andrew Bogott: Added dummy password for labspuppetbackend::mysql_password: [labs/private] - 10https://gerrit.wikimedia.org/r/294516 [16:06:17] PHP is a language allowing you to convert apples in pears [16:06:26] (03CR) 10Andrew Bogott: [C: 032 V: 032] Added dummy password for labspuppetbackend::mysql_password: [labs/private] - 10https://gerrit.wikimedia.org/r/294516 (owner: 10Andrew Bogott) [16:06:34] Sometimes, that creates this kind of strange issues. [16:06:55] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2382924 (10Antigng_) >>! In T137707#2379997, @jcrespo wrote: > For the API part, I would like to add that API infrastructure (application servers and databases) is s... [16:07:21] Thanks for your explaination. [16:07:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294514|Fix autopatrolled group for ko.wikipedia]] (duration: 00m 31s) [16:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:17] Now it works :) [16:08:46] Looks fixed! [16:09:12] jynus: SWAT is complete [16:09:49] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2382935 (10Paladox) @Danny_B I think we could redirect git.wikimedia.org today and improve the urls as needed. [16:10:11] thcipriani, thanks, I will deploy a quick query [16:10:44] (03PS1) 10Filippo Giunchedi: swift: add systemd unit file for proxy-server [puppet] - 10https://gerrit.wikimedia.org/r/294517 (https://phabricator.wikimedia.org/T117972) [16:11:12] 06Operations, 10ops-codfw, 10ops-eqiad, 10vm-requests: eqiad/codfw: 1 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2382941 (10fgiunchedi) [16:11:56] not query [16:11:59] patch [16:12:03] too much SQL [16:12:07] is too much [16:13:11] 06Operations, 10vm-requests: eqiad/codfw: 1 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2382946 (10RobH) [16:14:15] :D [16:14:27] (03PS1) 10Andrew Bogott: Add labspuppetbackend::mysql_password to a few more files [labs/private] - 10https://gerrit.wikimedia.org/r/294518 [16:14:42] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add labspuppetbackend::mysql_password to a few more files [labs/private] - 10https://gerrit.wikimedia.org/r/294518 (owner: 10Andrew Bogott) [16:15:48] (03PS2) 10Jcrespo: Repool db1023; pool db1085 (disabled), db1088, db1092 w/low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294502 (https://phabricator.wikimedia.org/T133398) [16:16:44] (03CR) 10Andrew Bogott: [C: 032] Have labspuppetbackend listen on an external IP rather than 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/294513 (https://bugzilla.wikimedia.org/294510) (owner: 10Andrew Bogott) [16:18:39] (03CR) 10Jcrespo: [C: 032] Repool db1023; pool db1085 (disabled), db1088, db1092 w/low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294502 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [16:22:24] (03CR) 10Alex Monk: "what's that bug number?" [puppet] - 10https://gerrit.wikimedia.org/r/294513 (https://bugzilla.wikimedia.org/294510) (owner: 10Andrew Bogott) [16:23:11] (03CR) 10Andrew Bogott: "hm, copy/paste fail I guess" [puppet] - 10https://gerrit.wikimedia.org/r/294513 (https://bugzilla.wikimedia.org/294510) (owner: 10Andrew Bogott) [16:23:54] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2382973 (10Papaul) ge-5/0/14 [16:24:26] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2382974 (10Papaul) [16:24:47] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2380670 (10Papaul) a:05Papaul>03RobH [16:24:59] (03PS1) 10Jcrespo: Small lint fix on database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294519 [16:25:22] (03PS2) 10Jcrespo: Small lint fix on database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294519 [16:27:13] (03CR) 10Jcrespo: [C: 032] Small lint fix on database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294519 (owner: 10Jcrespo) [16:32:07] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1023; pool db1085 (disabled), db1088, db1092 w/low weight (duration: 00m 25s) [16:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:08] twentyafterfour: Could you explain exactly what you did here: https://gerrit.wikimedia.org/r/#/c/294515/. This step is not documented in the documentation for how to deploy a new extension (https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Install_a_new_extension_on_a_wiki). [16:39:10] or if you'd prefer to update that documentation yourself, that would be cool too :) [16:40:32] kaldari: looking [16:41:37] kaldari: https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Deploy_to_beta_cluster_on_Labs ... [16:42:51] ah, missed that. [16:43:08] twentyafterfour: Why is that on mediawiki.org? [16:43:16] I'll move it [16:43:20] I'm not sure why that's on mediawiki.org [16:46:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5221566 keys - replication_delay is 711 [16:47:45] twentyafterfour: what's the actual syntax that you used to add the submodule? git submodule add ... [16:49:02] kaldari: git submodule add -b . --name RevisionSlider https://gerrit.wikimedia.org/r/p/mediawiki/extensions/RevisionSlider.git [16:49:12] thanks! [16:53:22] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5218396 keys - replication_delay is 0 [17:01:52] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:05] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2383177 (10akosiaris) >>! In T137763#2380627, @mmodell wrote: > Like @bblack, my main issue with alignment is the way puppet-lint forces me to re-indent a whole section just to a... [17:15:12] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:17:32] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2383198 (10mobrovac) [17:18:33] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2382319 (10mobrovac) I added apertium and OCG to the list. Zotero is not relevant as we'll never switch it to Jessie. For the rest, I really don't know if that would be useful. [17:18:51] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2383200 (10mobrovac) [17:20:10] (03CR) 10Gehel: [C: 04-1] "interactive-team probably does not want all notifications, but only those they can do something about..." [puppet] - 10https://gerrit.wikimedia.org/r/294503 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [17:21:00] 06Operations, 06Services, 07Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2383205 (10ssastry) [17:21:02] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2383204 (10ssastry) 05Open>03stalled [17:23:05] (03CR) 10Yurik: "Gehel, agree, we only want the stuff that we can act on - e.g. pertaining to the kartotherian and tilerator services. If the traffic to ma" [puppet] - 10https://gerrit.wikimedia.org/r/294503 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [17:51:51] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [17:55:33] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2383279 (10mmodell) [17:55:35] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2383278 (10mmodell) 05stalled>03Open [18:00:04] yurik, gehel, and thcipriani: Dear anthropoid, the time has come. Please deploy Scap3 Service Migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160615T1800). [18:00:04] yurik: A patch you scheduled for Scap3 Service Migration is about to be deployed. Please be available during the process. [18:00:17] here [18:00:21] i scheduled something? [18:00:29] * yurik looks [18:01:10] thcipriani, i think either something got copied without cleanup, or i miss-scheduled [18:01:11] (03CR) 1020after4: git.wikimedia.org -> Diffusion redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [18:01:25] do you want to do graphoid? [18:01:50] yurik: heh, yeah, I think this got copied from last week. [18:02:35] thcipriani, yurik do you need me? I did not have this one either... [18:03:14] yurik: if you've got the scap directory patch for graphoid, I made the puppet patch. [18:03:32] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.786 second response time [18:03:58] o/ [18:05:14] (03PS1) 10Hashar: group1 wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294527 [18:05:42] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.758 second response time [18:11:54] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2383330 (10Yurik) @gehel, what does `admins,sms,admins` mean? I think the first 4 items are relevant to @maxsem and myself - Cassandra not working, Kartotheri... [18:12:16] (03PS7) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [18:17:31] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:20:14] (03PS8) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [18:26:53] (03PS1) 10RobH: changing phab2001 partman use [puppet] - 10https://gerrit.wikimedia.org/r/294529 [18:27:57] (03PS3) 10BBlack: VCL: prevent inter-cache loop bugs with X-DCPath [puppet] - 10https://gerrit.wikimedia.org/r/294478 (https://phabricator.wikimedia.org/T134404) [18:28:19] (03CR) 10BBlack: [C: 032 V: 032] VCL: prevent inter-cache loop bugs with X-DCPath [puppet] - 10https://gerrit.wikimedia.org/r/294478 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [18:28:44] (03PS2) 10RobH: changing phab2001 partman use [puppet] - 10https://gerrit.wikimedia.org/r/294529 [18:28:51] (03CR) 10RobH: [C: 032 V: 032] changing phab2001 partman use [puppet] - 10https://gerrit.wikimedia.org/r/294529 (owner: 10RobH) [18:30:32] gerrit's acting very slow [18:31:41] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2383434 (10Papaul) a:05Papaul>03jcrespo BIOS compete on both systems. Both systems are now running version 2.1.6. @ori System is up back, it is all yours. Thanks [18:34:40] (03PS1) 10BBlack: VCL: protect loop protection from restarts [puppet] - 10https://gerrit.wikimedia.org/r/294530 (https://phabricator.wikimedia.org/T134404) [18:36:06] (03CR) 10BBlack: [C: 032] VCL: protect loop protection from restarts [puppet] - 10https://gerrit.wikimedia.org/r/294530 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [18:38:02] (03PS9) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [18:39:11] thanks papaul [18:43:05] 06Operations, 10Traffic, 10Wikimedia-Stream, 07HTTPS: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2383513 (10BBlack) [18:43:32] 06Operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2383531 (10BBlack) [18:43:34] 06Operations, 10Traffic, 10Wikimedia-Stream, 07HTTPS: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2383532 (10BBlack) [18:45:31] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2383541 (10ori) p:05Normal>03High [18:47:13] (03PS1) 10BBlack: r::c::instances: frontends also listen on :3127 [puppet] - 10https://gerrit.wikimedia.org/r/294532 (https://phabricator.wikimedia.org/T107236) [18:47:31] (03CR) 10BBlack: [C: 032 V: 032] r::c::instances: frontends also listen on :3127 [puppet] - 10https://gerrit.wikimedia.org/r/294532 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [18:50:24] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2383564 (10BBlack) [18:54:35] 06Operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#2383571 (10BBlack) [18:54:46] !log Started MySQL on es2019 (T130702) [18:54:47] T130702: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702 [18:54:49] jouncebot: poke [18:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:54] jouncebot: next [18:54:54] In 0 hour(s) and 5 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160615T1900) [18:55:35] (03PS2) 10Hashar: group1 wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294527 (https://phabricator.wikimedia.org/T136971) [18:56:28] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2383578 (10RobH) @mobrovac: Thank you for the feedback! We'll go with your appended list unless others can point out where the other admin services will migrate to jessie a... [18:57:23] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:58:53] well [18:59:06] I am most probably going to hold the train [18:59:27] because of some nasty stacktrace that appeared with 1.28.0-wmf.6 [19:00:05] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160615T1900). Please do the needful. [19:02:44] definitely holding [19:03:50] bah no found some in wmf.5 [19:03:59] (03CR) 10Hashar: [C: 032] group1 wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294527 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [19:04:31] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294527 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [19:05:21] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:19] :) [19:07:54] not much happening [19:08:12] despite some wikidataw links update jobs failling to acquire a lock [19:08:31] nothing new [19:11:05] slow query: SELECT /* CategoryMembershipChangeJob::run */ GET_LOCK('CategoryMembershipUpdates:XXXXXXXXXXXXXXX', 10) AS lockstatus [19:11:44] !log rolling restart of global varnish frontends (salt -b 1: depool -> sleep 15 -> restart -> repool) - estimated ~30 mins to completion - T107236 [19:11:45] T107236: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236 [19:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:01] Hashar seen that [19:12:21] that is in wmf.5 / wikidata related apparently [19:12:25] I think we can have those not run for wikidata [19:12:37] For items [19:12:47] Since they can't have categories [19:12:48] I have no clue what that job is doing, much less what GET_LOCK is for :D [19:13:44] I know what the job does (for category watchlist ) [19:14:24] not sure why exaxtly it's having problem except maybe there are so many changes [19:14:48] Especially on wikidata / s5 and they are not batches enough or such [19:15:46] !log varnish frontend restart halted - v4 compat issue to address :P [19:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:52] PROBLEM - Varnish HTTP misc-frontend - port 80 on cp1058 is CRITICAL: Connection refused [19:19:03] PROBLEM - Varnishkafka log producer on cp1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:19:40] (03PS1) 10BBlack: varnish: systemd unit varnish4.1 compat for "-a" [puppet] - 10https://gerrit.wikimedia.org/r/294537 (https://phabricator.wikimedia.org/T107236) [19:20:02] PROBLEM - Varnishkafka log producer on cp1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:20:12] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1059 is CRITICAL: Connection refused [19:20:15] (03CR) 10BBlack: [C: 032 V: 032] varnish: systemd unit varnish4.1 compat for "-a" [puppet] - 10https://gerrit.wikimedia.org/r/294537 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [19:21:02] audephone: and looks like the SELECT GET_LOCK() are just to poll for the lock status every 10 seconds [19:21:08] (03PS1) 10Urbanecm: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) [19:21:27] audephone: or with a timeout of 10 seconds. While HHVM reports the query being slow after 10 secs :D [19:21:36] ah, okay [19:21:45] (03CR) 10jenkins-bot: [V: 04-1] Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [19:22:03] RECOVERY - Varnishkafka log producer on cp1058 is OK: PROCS OK: 1 process with command name varnishkafka [19:22:21] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 298 bytes in 0.001 second response time [19:22:29] audephone: turns out it is already filled as https://phabricator.wikimedia.org/T133801 ! [19:23:01] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 303 bytes in 0.025 second response time [19:23:12] RECOVERY - Varnishkafka log producer on cp1059 is OK: PROCS OK: 1 process with command name varnishkafka [19:23:36] (03PS2) 10Urbanecm: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) [19:23:37] ok [19:24:13] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:24:15] (03CR) 10jenkins-bot: [V: 04-1] Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [19:25:13] !log rolling restart of global varnish frontends (salt -b 1: depool -> sleep 15 -> restart -> repool) - estimated ~35 mins to completion - T107236 (...._ [19:25:14] T107236: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236 [19:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:51] (03PS3) 10Urbanecm: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) [19:30:51] Hi, could anybody take care about https://gerrit.wikimedia.org/r/#/c/294538/ ? This must be deployed till this Friday and I can't be available during Evening SWATs and tomorrow SWATs. [19:35:38] Urbanecm: can potentially add it to the deploy calendar on https://wikitech.wikimedia.org/wiki/Deployments#Wednesday.2C.C2.A0June.C2.A015 ? [19:35:58] Urbanecm: eventually with a note you are not available at that time (evening swat is late for europe for sure [19:36:36] Urbanecm: $wgThrottlingExceptions are usually straightforward and I am sure they can just get merged [19:39:22] Hasar: If you can take care about in todays Evening SWAT, I have no problem with it. Should I add it to your IRC nick? [19:39:24] And I don't know if throttle rules can just get merged, I add it to SWAT calendar and I don't know what SWAtters do with them :). But if you're sure, maybe you can +2 it if you can. [19:39:49] Urbanecm: I am in Europe myself [19:40:04] * aude grumbles [19:40:14] but yeah I can just do it :D [19:40:16] looks like Special:Nearby is broken on wikidata [19:40:42] not terrible enough to revert though, but should fix before tomorrow or asap [19:40:44] aude: can it be related to 1.28.0-wmf.6 [19:40:48] could be [19:41:15] * aude gets a js error [19:41:39] copy pasdte to task and lets poke #wikimedia-mobile [19:41:45] ok [19:41:46] I guess they would know [19:41:56] like something not defined in resource loader [19:41:59] and either backport a patch or craft one on spot [19:42:02] Hasar: Oh. I thought that you know that I'm in Europe :). [19:42:52] Hashar: Correct ping [19:43:11] https://phabricator.wikimedia.org/T137919 [19:43:14] * aude goes to poke them [19:45:00] Urbanecm: a nitpick can you fix indentation to use tabs and drop the trailing whitespaces in https://gerrit.wikimedia.org/r/#/c/294538/3/wmf-config/throttle.php,cm ? [19:47:51] Urbanecm: or I can just do it [19:52:04] Yes, I'm going to do it. [19:52:40] 06Operations: ffmpeg/libav on jessie video scalers - https://phabricator.wikimedia.org/T137886#2383719 (10brion) Agreed; the packages on Jessie are likely also out of date both for ffmpeg2theora and libtheora. libtheora 1.2.0 is pretty stable with a hugely improved encoder, but never got an official blessed tar... [19:54:22] (03PS4) 10Urbanecm: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) [19:55:53] Urbanecm: thanks I will deploy it in a fe [19:55:53] w [19:56:10] fe is what? [19:56:38] (03PS1) 10Papaul: DNS: Add mgm tDNS entries for ms-be2022 to ms-be2027 Bug:T136630 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160615T2000). Please do the needful. [20:00:36] Urbanecm: in a few. Sorry [20:00:56] Hashar: in a few what? Moments? [20:01:12] gwicke: cscott subbu bearND mdholloway: deployment of MW to group1 is done. So you can process with whatever service deploy if any [20:01:37] Urbanecm: yeah. In a few minutes. I am finishing a few other things then I will push your change to prod sites [20:01:39] ok .. i'll start parsoid deploy soonish. [20:01:56] Hashar: Ok. Thanks [20:02:11] !log starting parsoid deploy [20:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:24] (03PS5) 10Hashar: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [20:03:24] (03PS2) 10Papaul: DNS: Add mgm tDNS entries for ms-be2022 to ms-be2027 Bug:T136630 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) [20:04:25] No mobileapps deploy today [20:04:37] !log synced new code; restarted parsoid on wtp1001 as a canary [20:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:47] (03PS6) 10Hashar: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [20:05:49] !log cache frontend restarts complete [20:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:35] (03CR) 10Hashar: "I have rebased the change and removed some trailing white spaces." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [20:09:03] !log finished deploying parsoid sha 3445eceb [20:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:14] subbu: :) [20:14:17] (03CR) 10Hashar: [C: 032] Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [20:14:28] hashar, ? :) [20:15:04] (03Merged) 10jenkins-bot: Temporary IP Cap Lift on es.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294538 (https://phabricator.wikimedia.org/T137917) (owner: 10Urbanecm) [20:15:52] subbu: let me give you the long translation : "congratulations subbu on having bumped Parsoid using the state of the art deployment process currently in use at WMF. I am quite happy to see Parsoid being bumped and apparently being a quality release with no side effect" [20:15:59] subbu: or in short: :) [20:16:37] ha ha .. ok. :) although we aren't yet on scap3 ;) [20:17:12] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:17:22] subbu: oh scap3 is not yet the state of the art. It is still at the step " bright new trendy future" :D [20:17:24] !log hashar@tin Synchronized wmf-config/throttle.php: Temporary IP Cap Lift on es.wiki T137917 (duration: 00m 30s) [20:17:25] T137917: Temporary IP Cap Lift on es.wiki - https://phabricator.wikimedia.org/T137917 [20:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:50] subbu: it will come for sure. There are dozens of repos to be migrated and it is not entirely trivial to migrate to [20:18:02] Urbanecm: I have deployed the change kudos [20:18:09] hashar, got it. :) [20:18:56] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2383811 (10Gehel) @Yurik `admins,sms,admins` are the groups to which the alerts are sent (I have no idea why `admins` is there twice). `admins` is the ops grou... [20:25:38] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2383822 (10RobH) [20:30:22] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 7 failures [20:32:32] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:53] (03CR) 10Gehel: "interactive team members have been added to private repo" [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:33:48] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2383834 (10RobH) a:05RobH>03mmodell I'm assigning this to Mukunda for his service implementation. He may want to resolve this task outright, since I'm under the impression this may... [20:34:04] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2383837 (10RobH) [20:54:32] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:12] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:56:31] PROBLEM - configured eth on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:41] PROBLEM - HHVM processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:42] PROBLEM - dhclient process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:02] PROBLEM - nutcracker port on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:12] PROBLEM - salt-minion processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:13] PROBLEM - SSH on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:22] PROBLEM - nutcracker process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:52] PROBLEM - Check size of conntrack table on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:52] PROBLEM - Disk space on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:58:11] PROBLEM - DPKG on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:04] mw1134.eqiad.wmnet had some puppet failures half an hour ago and now it is gone :/ [21:04:25] (03CR) 10MaxSem: [C: 031] Adding a "interactive-team" icinga group for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [21:06:43] (03PS2) 10Gehel: Adding a "interactive-team" icinga group for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) [21:09:50] (03CR) 10Gehel: [C: 032] Adding a "interactive-team" icinga group for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/294498 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [21:15:58] !log Deployed patch for T137264 to wmf.5 and wmf.6 [21:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:22] RECOVERY - Disk space on mw1134 is OK: DISK OK [21:19:32] RECOVERY - DPKG on mw1134 is OK: All packages OK [21:20:01] RECOVERY - configured eth on mw1134 is OK: OK - interfaces up [21:20:11] RECOVERY - HHVM processes on mw1134 is OK: PROCS OK: 6 processes with command name hhvm [21:20:11] RECOVERY - dhclient process on mw1134 is OK: PROCS OK: 0 processes with command name dhclient [21:20:31] RECOVERY - nutcracker port on mw1134 is OK: TCP OK - 0.000 second response time on port 11212 [21:20:51] RECOVERY - salt-minion processes on mw1134 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:21:11] (03CR) 10Gehel: "LGTM and trivial enough." [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) (owner: 10Muehlenhoff) [21:21:31] RECOVERY - Check size of conntrack table on mw1134 is OK: OK: nf_conntrack is 0 % full [21:22:51] RECOVERY - SSH on mw1134 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [21:22:52] RECOVERY - nutcracker process on mw1134 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:24:22] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:26:32] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5213389 keys - replication_delay is 0 [21:27:08] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes/libs/objectcache/WANObjectCache.php: faff8f1ef1bfefd1804a3f46e58566711faa3224 (duration: 00m 27s) [21:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:31] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:30:27] (03PS2) 10Aaron Schulz: Set "sync" filebackend replication to measure latency effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293272 [21:30:31] (03CR) 10Aaron Schulz: [C: 032] Set "sync" filebackend replication to measure latency effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293272 (owner: 10Aaron Schulz) [21:31:13] (03Merged) 10jenkins-bot: Set "sync" filebackend replication to measure latency effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293272 (owner: 10Aaron Schulz) [21:32:04] !log aaron@tin Synchronized wmf-config/filebackend-production.php: Set "sync" filebackend replication to measure latency effect (duration: 00m 25s) [21:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:40:01] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [21:46:23] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2383998 (10mmodell) [21:46:26] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2383995 (10mmodell) 05Open>03Resolved a:05mmodell>03RobH Thanks @robh. I'm marking this resolved and creating a separate task for deploying phabricator to phab2002 [21:52:42] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2380670 (10mmodell) [21:59:12] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2384033 (10mmodell) [22:05:31] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [22:16:42] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2384083 (10Papaul) @fgiunchedi what partman recipe do you want to use with the new systems ? The systems have 12x3TB SATA and 2x200GB SAS [22:25:27] (03PS10) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [22:31:03] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:31:42] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [22:36:26] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2384126 (10Papaul) [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160615T2300). Please do the needful. [23:01:14] nothing to swat [23:04:27] not even a fly? [23:11:44] I have something to add, but I can do it myself. [23:25:24] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2384236 (10ori) >>! In T134871#2382462, @BBlack wrote: > @ori @Krinkle - any thoughts or pointers on getting this tested... [23:27:11] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:31:21] Krenair, matt_flaschen - are you done swatting flies? https://gerrit.wikimedia.org/r/#/c/294647/ [23:31:47] I never even logged in. Matt? [23:32:00] its a minor JS fix :) [23:32:17] No, it only finished merging a minute ago. Sorry for adding it late. Will do it now. [23:32:40] matt_flaschen, can you do mine too pls? [23:32:47] its still merging i think :) [23:33:11] merged [23:33:17] yurik, yeah, you need to backport it though. [23:34:12] matt_flaschen, https://gerrit.wikimedia.org/r/#/c/294648/ [23:34:23] i will add it to the depl window [23:34:27] thanks! [23:35:41] added [23:37:58] !log mattflaschen@tin Synchronized php-1.28.0-wmf.6/extensions/Echo: Sync Echo fix for cross-wiki notifications: 62324e3 (duration: 00m 33s) [23:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:21] matt_flaschen, do you want me to +2 it? [23:39:04] yurik, no, I'll +2. You only want it for wmf.6? WP is on wmf.5, so just confirming. [23:39:19] matt_flaschen, i just checked - wv is on 6 [23:39:59] * yurik waits patiently for it to break everything :) [23:40:47] Echo fix confirmed in production. [23:45:14] yurik, +2'ed, but you may need to revisit that. I don't see anything stopping it from getting double-registered. wikipage.content can be fired more than once per page load, e.g. from VisualEditor. [23:45:39] jgirault, ^ [23:45:47] matt_flaschen, thx, will revisit [23:46:24] matt_flaschen, will probably deal with it in the next train, unless something ugly surfaces earlier [23:47:49] matt_flaschen: I see… I’ll fix that, though it shouldn’t be a big issue for now. [23:51:51] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.560 second response time [23:53:31] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66303 bytes in 0.210 second response time [23:54:51] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures