[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T0000). [00:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:22] RoanKattouw: I'm updating the interwiki map, then I'm done with deployment [00:00:41] There are also patches from AndyRussG [00:01:26] Dereckson: just one!! thx :) yeah just added it on [00:02:10] (just a cherry-pick, we'll put the rest of what's in master on the train) [00:03:10] Well no need for the interwiki map apparently, ieg, grants, etc. don't have one [00:03:37] !log projectcom.wikimedia.org wiki creation done [00:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:18] Now, SWAT. I'll do AndyRussG first. [00:06:06] Dereckson: cool thanks! [00:07:32] So there is one new commit, Handle banner loader errors on client (T149107), and that's live on mw1099 [00:07:32] T149107: CentralNotice: Relay banner loading issues in beacon/impression - https://phabricator.wikimedia.org/T149107 [00:07:32] * AndyRussG feels lucky [00:11:40] Dereckson: looks fine :) [00:11:53] logs looks good too [00:12:43] !log dereckson@tin Synchronized php-1.29.0-wmf.1/extensions/CentralNotice/: Handle banner loader errors on client (T149107) (duration: 00m 49s) [00:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:50] T149107: CentralNotice: Relay banner loading issues in beacon/impression - https://phabricator.wikimedia.org/T149107 [00:13:48] AndyRussG: live in prod [00:13:49] * AndyRussG breaks champaign bottle on the bow of Gerrit [00:13:58] RoanKattouw: ping? [00:15:11] Dereckson: sorry, here now [00:16:23] Dereckson: Both of my config patches can go together [00:16:40] ok [00:17:00] Dereckson: yeah looking fine in all of prod too... thx much!! [00:17:13] * AndyRussG waves [00:17:23] RoanKattouw: please check https://gerrit.wikimedia.org/r/#/c/319968/ there are some comments [00:17:37] AndyRussG: you're welcome [00:17:51] Dereckson: Ugh, OK, drop that patch for now thne [00:17:57] I don't have time to deal with Volker's comments right now :/ [00:20:17] (03CR) 10Catrope: "Re 1: do you know what to do about that? I just used Commons's SVG->PNG rendering. I don't know what you're talking about re compression, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [00:22:59] (03PS1) 10Chad: static.php: Remove unused $maxage param from wmfStaticShowError() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320321 [00:23:02] (03PS1) 10Chad: static.php: Consolidate error headers in wmfStaticShowError() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320322 [00:23:30] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:23:40] PROBLEM - HHVM rendering on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [00:24:28] * Dereckson adds a throttle rule change to SWAT [00:24:40] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 72298 bytes in 0.097 second response time [00:30:38] (03PS1) 10Dereckson: Nashville Science edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320323 (https://phabricator.wikimedia.org/T150207) [00:32:39] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320323 (https://phabricator.wikimedia.org/T150207) (owner: 10Dereckson) [00:33:20] (03Merged) 10jenkins-bot: Nashville Science edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320323 (https://phabricator.wikimedia.org/T150207) (owner: 10Dereckson) [00:33:55] Works on mw1099 [00:34:00] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:34:46] !log dereckson@tin Synchronized wmf-config/throttle.php: Nashville Science edit-a-thon (Vanderbilt library) (T150207) (duration: 00m 47s) [00:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:52] T150207: Women in Science Edit-A-Thon - Nashville, TN - Lift New Account Limit on 2016-11-15 - https://phabricator.wikimedia.org/T150207 [00:35:09] SWAT done. [00:50:59] !log swift eqiad-prod: set weight for ms-be1021 sd[h-n] to 3000 - T139767 [00:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:06] T139767: ms-be1021.eqiad.wmnet: slot=1I:1:2 dev=sdh failed - https://phabricator.wikimedia.org/T139767 [01:00:52] Holy craaaap, lol. [01:01:16] https://commons.wikimedia.org/wiki/Special:TimedMediaHandler <- harej is killing the video scalers. [01:02:00] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:03:44] I'm not sure if that's killing them [01:03:49] They're being fully utilised, yes [01:04:27] I meant more the 400 or so queued… didn not mean ‘killing’ them literally. [01:05:38] Ops are aware, and will start the process of finding hardware... Whether repurposing spares, or purchasing new as necessary [01:06:08] Yeah, I saw the ticket, was not nagging, just kind of going ‘oh, wow’ [01:06:59] If there's spare kit to be repurposed, it might not take very long [01:07:08] If stuff has to be bought... It might [01:08:44] Reedy: BTW, did you see https://phabricator.wikimedia.org/T150158 ? I know brion said he had a list of bugs to fix at some point [01:09:16] Is it really 43 being transcoded simultaneously? [01:09:30] It’s ‘trying’ to.... [01:09:35] That doesn't feel very efficient [01:09:58] Yeah, I suspect it’s swapping a lot. [01:10:16] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2778480 (10EdErhart-WMF) @BBlack we're working on it now as part of a larger effort to tweak the blog's theme. [01:10:27] (not swapping to disk, I mean, but trying to ‘multitask’ [01:10:47] I’m just glad it’s not trying to do 400 at once, lol. [01:11:15] I bet they'll benefit from upgrading past 14.04 too [01:12:22] The host is definitely busy. Doesn't look to be particularly struggling though [01:12:51] The issue imo is how many end up erroring out. [01:13:23] The question is why they're erroring out [01:13:34] I dunno if it's errors due to load [01:13:37] Or "bad" files [01:13:43] Or "bad" versions of ffmpeg etc [01:13:44] Revent: have you filed a task about that, so people can investigate and easily leave notes? [01:14:36] I think it’s load… I’ve seen ones I poked back on error if I load it too much, and then work if I just ran one or two at a time. [01:15:09] The simultaneous jobs should just be dropped downt hen [01:15:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1228 [01:15:35] p858snake|L2: Not seperately about the ones errored out, but it was mentioned at the task about adding capacity. [01:16:01] https://phabricator.wikimedia.org/T150067 [01:16:09] We do seem to be running an old version of ffmpeg too... I guess, relatd to them being 14.04 [01:17:24] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/mediawiki/videoscaler.yaml [01:17:28] mediawiki::jobrunner::runners_transcode: 5 [01:17:32] nfi what that means [01:19:51] FWIW, from what I’ve noticed it seems that mostly HD files error out. [01:20:10] RECOVERY - check_mysql on lutetium is OK: Uptime: 26209 Threads: 4 Questions: 1114262 Slow queries: 116 Opens: 153628 Flush tables: 2 Open tables: 64 Queries per second avg: 42.514 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 467 [01:20:39] In particular really ‘long’ (like 30 min or more) ones [01:22:26] You should leave these thoughts/observations on relevant tasks [01:25:33] Not really sure if there’s anything relevant open, tbh… don’t want to just make one for ‘generic comments about brokenness in video scaling, lol. [01:25:53] I mean, I could, I guess. [01:27:58] (03PS1) 10Reedy: Add PageViewInfo to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320333 [01:29:58] (03PS1) 10Reedy: Remove OATHAuth from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320334 [01:32:09] (03PS2) 10Reedy: Add PageViewInfo to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320333 (https://phabricator.wikimedia.org/T129602) [01:34:26] (03PS1) 10Andrew Bogott: Add some error handling to wikistatus, and make more thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/320335 [02:15:05] (03PS5) 10Dzahn: tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 [02:21:58] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2778584 (10Dzahn) a:05Dzahn>03None [02:27:14] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 09m 52s) [02:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:43] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2778588 (10Dzahn) Josephine of OIT has created a new Google group for us, we agreed on "ops-maintenance@" (ZenDesk (#11955) ) We can enable the shared inbox ourselves following instructions on https://support.google.co... [02:31:30] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Nov 8 02:31:30 UTC 2016 (duration 4m 16s) [02:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:11] (03PS1) 10Dzahn: repeat hostname for AAAA record (aluminium,achernar,multatuli) [dns] - 10https://gerrit.wikimedia.org/r/320342 [02:43:39] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2778594 (10Dzahn) I have invited Ariel, Jeff and Papaul as members and tested mailing the group from my personal address and replying to it. It worked and shows as a "topic" in the group. Also set group welcome message. I... [02:45:00] (03PS2) 10Dzahn: repeat hostname for AAAA (acamar,aluminium,multatuli) [dns] - 10https://gerrit.wikimedia.org/r/320342 [02:46:06] (03CR) 10Dzahn: [C: 032] repeat hostname for AAAA (acamar,aluminium,multatuli) [dns] - 10https://gerrit.wikimedia.org/r/320342 (owner: 10Dzahn) [02:48:43] (03Abandoned) 10Dzahn: wikimedia.org: repeat hostname on each line for multi records [dns] - 10https://gerrit.wikimedia.org/r/304155 (owner: 10Dzahn) [02:51:22] (03CR) 10Dzahn: Add puppet-lint to Rakefile / Gemfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288620 (owner: 10Hashar) [02:52:12] (03CR) 10Dzahn: Move config for git-ssh(phabricator) to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318662 (https://phabricator.wikimedia.org/T143363) (owner: 1020after4) [02:54:14] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2778611 (10Dzahn) Is it possible to re-enable puppet for a single run so that "gallium" gets removed from it (i ran node deactivate earlier today). [03:20:20] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:22:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 819.92 seconds [03:23:40] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2778617 (10Peachey88) [03:24:51] anyone around to do a labs-config sync? [03:25:11] i wouldn't want to do it if noone else is around :)\ [03:30:57] yurik: Reedy maybe, if you ask nicely [03:31:04] :) [03:31:12] Reedy should be deep asleep by now :) [03:34:09] (03PS1) 10Yurik: LABS: added beta.wmflabs.org to graphs config [puppet] - 10https://gerrit.wikimedia.org/r/320343 [03:35:02] (03PS1) 10Yurik: LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 [03:36:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 276.89 seconds [03:39:23] yurik: i just relieased it been much longer than I thought when I last saw him say something [03:39:47] heh, tis ok, i can wait... in pain, but wait :) [03:49:20] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [04:10:32] p858snake|L2: video scalers are ‘still’ working on that huge pile of uploads, lol. [04:11:18] Almost through them though, which is not bad at all given how many it was… [04:11:47] I think only one errored out, suprisingly. [04:11:58] (tho they were not the huge files) [04:13:40] PROBLEM - HHVM rendering on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [04:14:40] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 72300 bytes in 0.469 second response time [06:29:40] PROBLEM - Disk space on logstash1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:30:20] PROBLEM - Disk space on logstash1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:30:30] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [07:02:51] <_joe_> again? [07:02:56] <_joe_> jesus [07:03:40] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:05:30] RECOVERY - Disk space on logstash1001 is OK: DISK OK [07:05:37] <_joe_> so the problem is [07:05:51] <_joe_> the 20 GB logfiles don't get rotated correctly [07:07:20] RECOVERY - Disk space on logstash1002 is OK: DISK OK [07:09:40] RECOVERY - Disk space on logstash1003 is OK: DISK OK [07:10:19] <_joe_> !log stopped logstash, removed large logfiles that were erroneously non-rotated, started logstash across the logstash cluster [07:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:22] <_joe_> copytruncate is the source of all evil ^^ [07:28:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) [07:30:03] !log Deploy schema change s5 dewiki.revision on codfw master (db2023) - T148967 [07:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:09] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:31:40] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:34:11] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2778729 (10grin) Another sidenote: this decision should have a good visibility to the people planning server resources. And I try to ask around MapQuest what traffic... [07:43:01] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [08:00:10] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2778754 (10Marostegui) [08:00:38] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2778772 (10Marostegui) [08:00:40] 06Operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#2778773 (10Marostegui) [08:01:09] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2778754 (10Marostegui) [08:01:12] 06Operations, 10ops-codfw, 10DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356666 (10Marostegui) [08:01:34] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2778754 (10Marostegui) [08:01:53] I am done with the spam :-) [08:04:25] !log rebooting stat1001 for kernel upgrades (will cause a brief unavail for analytics websites) [08:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:59] !log rolling reboot of parsoid in eqiad for kernel update [08:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:58] Out of the earlier video scaler ‘overload’ of 400 + files, it looks like the only one that failed was Moscow_Ring_Railway_full_trip_-_view_from_ES2G_train.webm at 1080P and 720P… it’s 1h22m long, at 1920x1080. I’m wondering the the cause of files erroring out is simply that the scalers can’t complete some tasks ‘in time’ if the scaler gets overloaded and starts trying to split the CPU between tasks. [08:21:34] !log rolling reboot of swift backend servers in esams for kernel update [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:18] Since transcoding is a ‘single cpu’ task, limiting the umber at transcodes ‘running’ to the number of CPUs (and adding capacity) would seem to be the answer. [08:22:50] *not 400+ files, 400+ transcodes [08:32:20] PROBLEM - Apache HTTP on mw1233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [08:32:30] <_joe_> Revent: not really single-cpu, but something like that, yes [08:32:30] PROBLEM - HHVM rendering on mw1233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [08:33:12] _joe_: Yea, hyperthreading… does not really help with such a task. [08:33:20] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.021 second response time [08:33:30] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 72215 bytes in 0.127 second response time [08:34:48] _joe_: Point being… more power (ofc), less concurrent tasks allowed per CPU, and (maybe) a longer timeout [08:36:10] It seems like ‘long’ transcodes… 1080p, and hour+… seem to be very sensitive to the scalers going over 50%. [08:36:37] <_joe_> Revent: the issue, as p858snake|L2 was saying yesterday, is that the jobrunner thinks jobs are completed before they really are [08:36:50] <_joe_> so whatever config tuning we might do won't work [08:37:21] _joe_: I am not a tech guy, I’m slightly clueful and kibitizing on what I see. [08:37:34] Just… have been actually watching it. [08:37:44] (03PS1) 10Arseny1992: Enable translation memory of Translate for frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320352 (https://phabricator.wikimedia.org/T150146) [08:38:05] _joe_: Also… https://phabricator.wikimedia.org/T150158 [08:38:10] Might be relevant [08:38:14] <_joe_> Revent: yeah I wasn't implying you should be :) I was just stating how it's a software bug and I cannot do much more than ask brion to prioritize looking into it :) [08:40:12] Revent: It would be helpful, if you filed tasks about these issues, we appear to be going around in circles and just repeating the same information to different people [08:40:19] _joe_: But… harej dumed over 400 transcodes of SD files into the queue, it loaded up to 50 or so at a time, and none failed [08:41:28] p858snake|L2: As I said, I’m not clued in enough to file anything other than “generic observations on transcoding issues” [08:41:59] just put exactly what you write in here… [08:44:33] !log Deploy schema change s4 commonswiki.revision table - T147305 [08:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:39] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:45:58] p858snake|L2: I was somewhat hoping that one of the people I was commenting to would be able to (having more clue than me) themselves be able to figure out what the various issue are. [08:46:33] Because, frankly, I’ve been quite wrong about several sapects of how it works. [08:46:38] *aspects [08:59:41] p858snake|L2: https://phabricator.wikimedia.org/T150235 <- better? [09:01:20] PROBLEM - HHVM rendering on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [09:01:30] PROBLEM - Apache HTTP on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [09:02:20] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 72215 bytes in 0.131 second response time [09:02:30] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.066 second response time [09:16:11] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Avoid thumbor generating log files > 1GB - https://phabricator.wikimedia.org/T150208#2778863 (10Gilles) [09:18:25] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2778869 (10hoo) Heads up: In {T144592} we decided to index exactly 1,000 placeholders on eowiki. All other placeholders will not b... [09:24:10] !log rebooting graphite1002 for kernel update [09:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:30] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:42:29] !log rebooting bast2001 for kernel update [09:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:18] 06Operations, 10Prod-Kubernetes, 10Traffic, 05Kubernetes-production-experiment, 13Patch-For-Review: Make our docker registry public - https://phabricator.wikimedia.org/T150168#2778882 (10Joe) 05Open>03Resolved [09:47:53] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2778883 (10jcrespo) @Marostegui thank you for this work, I know it takes some time [09:48:33] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#2778884 (10Joe) 05Open>03Resolved [09:48:45] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#2737849 (10Joe) [09:49:53] !log rebooting install2001 for kernel update [09:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:23] 06Operations, 10Monitoring, 15User-Joe: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2778887 (10Joe) a:03Joe [09:50:58] (03CR) 10Jcrespo: db-eqiad.php: Depool db1059 for maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [09:52:11] (03CR) 10Jcrespo: db-eqiad.php: Depool db1059 for maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [09:53:58] (03PS2) 10Marostegui: db-eqiad.php: Depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) [09:57:10] !log rebooting mw2075 - mw2079 for new kernel [09:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:32] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [09:59:25] !log rebooting oxygen for kernel update [09:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:40] PROBLEM - Apache HTTP on mw1225 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [10:00:28] (03CR) 10Marostegui: "check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [10:00:40] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.024 second response time [10:01:31] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:05:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [10:06:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320346 (https://phabricator.wikimedia.org/T149079) (owner: 10Marostegui) [10:06:47] !log rebooting rhenium for kernel update [10:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:57] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2778941 (10ema) The cache_text waitinglist issue should be solved by https://gerrit.wikimedia.org/r/#/c/320259/ [10:08:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1059 - T149079 T147305 (duration: 00m 57s) [10:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:10] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [10:08:10] T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079 [10:10:42] (03PS1) 10Jcrespo: Pool back db1051 and api servers to high load after hw issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320358 (https://phabricator.wikimedia.org/T149908) [10:10:58] (03PS1) 10Giuseppe Lavagetto: naggen2: order resources in python [puppet] - 10https://gerrit.wikimedia.org/r/320359 (https://phabricator.wikimedia.org/T150061) [10:14:00] PROBLEM - DPKG on restbase2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:14:00] PROBLEM - DPKG on restbase2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:15:00] RECOVERY - DPKG on restbase2007 is OK: All packages OK [10:15:00] RECOVERY - DPKG on restbase2006 is OK: All packages OK [10:20:40] PROBLEM - NTP on mw2075 is CRITICAL: NTP CRITICAL: Offset unknown [10:22:26] !log rebooting hafnium for kernel update [10:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:06] !log restarted ntp on mw2075, stuck in XFAC state [10:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:40] RECOVERY - NTP on mw2075 is OK: NTP OK: Offset 0.1037424505 secs [10:27:20] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [10:28:20] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 72361 bytes in 0.089 second response time [10:33:32] !log rebooting tin for kernel update [10:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:30] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:36:38] !log rebooting and upgrading db2012 [10:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:43] !log rebooting mw2086 - mw2089 for new kernel [10:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:41] !log rearmed keyholder on tin [10:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:22] (03PS2) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [10:46:00] (03CR) 10jenkins-bot: [V: 04-1] conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [10:53:43] !log rolling reboot of mw2090 - mw2096 for new kernel [10:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:17] 06Operations, 06Discovery, 06Maps, 10Maps-data, 10hardware-requests: 2 servers for maps-beta cluster - https://phabricator.wikimedia.org/T138600#2779052 (10Gehel) 05Open>03declined We will try first to setup a maps beta cluster on labs VMs, which should not be an issue. I will re-open this task shoul... [10:56:30] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2779055 (10BBlack) Nothing was ever resolved here. 30 minutes seems like an arbitrary number with no formal basis or reasoning, an... [10:57:24] (03PS1) 10Elukey: Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) [10:58:42] (03PS3) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [10:59:15] (03CR) 10jenkins-bot: [V: 04-1] conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [11:00:34] (03PS2) 10Elukey: Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) [11:01:13] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2779075 (10BBlack) I clicked Submit too soon :) Continuing: We'd expect content to be at minimum a day, if not significantly longe... [11:01:22] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Avoid thumbor generating log files > 1GB - https://phabricator.wikimedia.org/T150208#2779076 (10Gilles) [11:01:48] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/320359 (https://phabricator.wikimedia.org/T150061) (owner: 10Giuseppe Lavagetto) [11:02:35] !log Activated cr2-eqiad bgp group IX4 [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:05:59] (03PS4) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [11:06:03] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4563/mw1226.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [11:06:19] (03CR) 10Giuseppe Lavagetto: [C: 032] naggen2: order resources in python [puppet] - 10https://gerrit.wikimedia.org/r/320359 (https://phabricator.wikimedia.org/T150061) (owner: 10Giuseppe Lavagetto) [11:06:37] (03CR) 10jenkins-bot: [V: 04-1] conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [11:09:11] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2722084 (10BBlack) The real problem here is a misunderstanding between varnish shm log's semantics and how we're interpreting that in... [11:09:22] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2779114 (10KartikMistry) Looks good after testing on a different machine(s). [11:10:03] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2779115 (10KartikMistry) [11:10:20] (03PS5) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [11:10:54] (03CR) 10jenkins-bot: [V: 04-1] conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [11:14:30] (03PS6) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) [11:15:33] !log running schema change on db2070 (pagelinks) T139090 [11:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [11:15:50] (03PS6) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [11:17:26] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:18:11] (03PS7) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [11:19:56] !log rolling restart of mw2080-2085 for new kernel [11:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:55] 06Operations, 10Monitoring, 13Patch-For-Review, 15User-Joe: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2779147 (10Joe) With the current naggen2 update the output files are stabe again. I reenabled puppet on both hosts and it's running without issues. [11:27:00] (03PS4) 10Mark Bergsma: Reflect new FPC3 ports after cr1-/cr2-eqiad FPC5 decommissioning [dns] - 10https://gerrit.wikimedia.org/r/319617 (https://phabricator.wikimedia.org/T149196) [11:27:15] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2779156 (10Gehel) Replication frequency is set to 1 hour on the maps-test cluster. We can see that the server load average and IO peaks every hour and barely has time to go back down... [11:28:23] (03PS8) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [11:29:45] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [11:30:16] (03PS1) 10Giuseppe Lavagetto: conftool: add --host option [software/conftool] (0.3.x) - 10https://gerrit.wikimedia.org/r/320367 (https://phabricator.wikimedia.org/T149213) [11:30:54] !log rebooting mira for kernel update [11:30:54] (03CR) 10jenkins-bot: [V: 04-1] Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [11:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: add --host option [software/conftool] (0.3.x) - 10https://gerrit.wikimedia.org/r/320367 (https://phabricator.wikimedia.org/T149213) (owner: 10Giuseppe Lavagetto) [11:32:50] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [11:35:26] (03CR) 10Giuseppe Lavagetto: [C: 031] Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [11:37:18] !log running schema change on db1045 (pagelinks) T139090 [11:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:25] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [11:38:07] !log rolling reboot of mw1161, mw1163-1169 for new kernel [11:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:49] !log rearmed keyholder on mira [11:42:50] 06Operations, 10Traffic: Varnish4 is unexpected retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779196 (10BBlack) [11:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:38] !log rebooting logstash1001 for kernel update [11:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:26] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:46:25] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2779250 (10mobrovac) The only ones left are the Maps services. @Yurik @Gehel could you test them with Node 6? [11:46:40] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2779251 (10mobrovac) p:05Triage>03Normal [11:46:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [11:49:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [11:51:13] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2779263 (10elukey) I created https://etherpad.wikimedia.org/p/mc-migration to outline the procedure for the swap. [11:51:45] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2779264 (10Gehel) @mobrovac sure, we'll have a look asap... [11:52:02] (03CR) 10Gehel: [C: 032] Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [11:52:16] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:44] 06Operations, 10Traffic: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779266 (10BBlack) [11:53:48] 06Operations, 10Traffic: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779196 (10BBlack) [11:54:47] !log restart of mw1240, 1253 for new kernel [11:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:51] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2779277 (10Kelson) @AlexMonk Yes, this is highly probable that we could fix the problem that way. I had created a while ago this request in an attempt to push things this direction #T... [11:58:46] PROBLEM - NTP on mw1161 is CRITICAL: NTP CRITICAL: Offset unknown [11:59:37] !log rolling restart of mw1170-1216 for new kernel [11:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:32] (03PS1) 10Ema: 4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/320371 (https://phabricator.wikimedia.org/T150247) [12:01:46] RECOVERY - NTP on mw1161 is OK: NTP OK: Offset 0.0006526708603 secs [12:02:23] (03PS1) 10Giuseppe Lavagetto: Add coverage reporting to CI [software/conftool] - 10https://gerrit.wikimedia.org/r/320372 [12:05:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] [12:06:34] (03CR) 10Marostegui: [C: 031] Pool back db1051 and api servers to high load after hw issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320358 (https://phabricator.wikimedia.org/T149908) (owner: 10Jcrespo) [12:06:37] (03PS2) 10Ema: 4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/320371 (https://phabricator.wikimedia.org/T150247) [12:06:41] (03CR) 10Jcrespo: [C: 032] Pool back db1051 and api servers to high load after hw issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320358 (https://phabricator.wikimedia.org/T149908) (owner: 10Jcrespo) [12:07:17] (03PS3) 10Elukey: Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) [12:09:20] (03PS2) 10Giuseppe Lavagetto: Add coverage reporting to CI [software/conftool] - 10https://gerrit.wikimedia.org/r/320372 [12:10:06] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Avoid thumbor generating log files > 1GB - https://phabricator.wikimedia.org/T150208#2779311 (10Gilles) [12:10:32] (03PS1) 10Gilles: Upgrade to 0.1.29 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/320374 (https://phabricator.wikimedia.org/T150208) [12:10:42] (03PS3) 10Ema: 4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/320371 (https://phabricator.wikimedia.org/T150247) [12:11:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool back db1051 and api servers to high load after hw issues (duration: 02m 45s) [12:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:19] ssh: connect to host mw1167.eqiad.wmnet port 22: Connection timed out [12:11:24] reboots? [12:11:35] jynus: yep, see SAL [12:12:31] (03PS2) 10Gilles: Rotate Thumbor 404 log by size, not date [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) [12:12:44] !log depooling/rebooting/repooling scb1001 for kernel update [12:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:12] (03PS3) 10Gilles: Rotate Thumbor 404 log by size, not date [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) [12:13:27] running pull on mw1167 [12:13:30] (03CR) 10Gilles: Rotate Thumbor 404 log by size, not date (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) (owner: 10Gilles) [12:15:37] 1167 will be back up momentarily [12:16:01] correction. it's already back [12:16:16] difficult to run pull without it being up :-) [12:16:39] well I didn't see your pull comment, I just went to look at my reboots in progress right away :-) [12:17:00] anyways there's only the one series left now, mw1170-1216 [12:17:19] slowly grinding away [12:21:16] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:21:29] (03PS4) 10Ema: 4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/320371 (https://phabricator.wikimedia.org/T150247) [12:22:06] PROBLEM - NTP on hafnium is CRITICAL: NTP CRITICAL: Offset unknown [12:22:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [12:23:10] !log restarted ntp on hafnium, stuck in XFAC state [12:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:56] PROBLEM - NTP on mw1166 is CRITICAL: NTP CRITICAL: Offset unknown [12:24:48] (03PS3) 10Giuseppe Lavagetto: Add coverage reporting to CI [software/conftool] - 10https://gerrit.wikimedia.org/r/320372 [12:24:50] (03PS1) 10Giuseppe Lavagetto: Add travis build support [software/conftool] - 10https://gerrit.wikimedia.org/r/320375 [12:24:51] !log depooling/rebooting/repooling scb1002 for kernel update [12:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:43] (03CR) 10Giuseppe Lavagetto: [C: 032] Add coverage reporting to CI [software/conftool] - 10https://gerrit.wikimedia.org/r/320372 (owner: 10Giuseppe Lavagetto) [12:28:29] (03CR) 10Ema: [C: 032] 4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/320371 (https://phabricator.wikimedia.org/T150247) (owner: 10Ema) [12:28:33] !log restarted ntp on mw1166, stuck in XFAC state [12:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:50] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2779342 (10BBlack) >>! In T93927#2717425, @BBlack wrote: > The downsides to switching to the internal stapling code: > > 1. It still d... [12:31:22] 06Operations, 06Discovery, 06Maps: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2779343 (10Gehel) Most of the metrics published by kartotherian are "markers": ``` gehel@graphite1001:/var/lib/carbon/whisper$ find kartotherian/marker/ -type... [12:31:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [12:31:51] jouncebot: next [12:31:51] In 1 hour(s) and 28 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1400) [12:31:56] RECOVERY - NTP on mw1166 is OK: NTP OK: Offset 0.0779427886 secs [12:33:24] (03CR) 10Dereckson: "We perhaps need a consistent policy here, because regularly there are changes to clean up and get rid of config repeating default values." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320352 (https://phabricator.wikimedia.org/T150146) (owner: 10Arseny1992) [12:34:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [12:37:36] (03CR) 10jenkins-bot: [V: 04-1] Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [12:37:58] apergos: o/ [12:38:07] hey [12:38:12] I am seeing tons of errors in https://logstash.wikimedia.org/app/kibana#/dashboard/memcached [12:38:42] especially for SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [12:38:53] can you check if these hosts are the one rebooted? [12:39:10] last one is mw1198 [12:39:42] !log rebooting iron for kernel update [12:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] mmmm mw1198 has been up for 6 mins [12:40:13] 1198 was rebooted a bi ago, yes [12:40:56] PROBLEM - NTP on mw1194 is CRITICAL: NTP CRITICAL: Offset unknown [12:41:14] now doing 1201-1216, then 1170-1188 in order [12:41:25] that will be the last of them [12:41:30] !log restarted ntp on mw1194, stuck in XFAC state [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:50] moritzm: maybe we should just switch to systemd-timesyncd [12:42:06] RECOVERY - NTP on hafnium is OK: NTP OK: Offset 0.001634895802 secs [12:42:10] (not kidding) [12:42:33] is systemd project basically going to overhaul every single daemon/old unix systems out there? [12:42:34] apergos: any special procedure that you are following? It is the first time that I see these failures :( [12:42:42] (03PS1) 10Gehel: maps - fix resource dependencies for tiles database creation [puppet] - 10https://gerrit.wikimedia.org/r/320381 (https://phabricator.wikimedia.org/T147223) [12:42:50] downtime in icinga, depool, reboot, repool [12:42:57] mw contacts nutcracker in localhost and it seems failing [12:43:01] (03PS1) 10Jcrespo: Depool db1080 to deploy safely a long-running schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320382 (https://phabricator.wikimedia.org/T139090) [12:44:28] From Mw fatals: Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [2]: No such file or directory [12:44:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [12:45:40] I've paused things at mw1203.eqiad.wmnet so we can investigate if you like, elukey [12:45:51] mw1201 is the last one alarming afaics [12:46:20] 06Operations, 10ops-eqiad, 06DC-Ops, 10Traffic, and 2 others: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#2779419 (10faidon) [12:46:23] 06Operations, 10ops-eqiad, 06DC-Ops, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#2779417 (10faidon) 05stalled>03declined asw-d-eqiad is on its way to being decom'ed (or upgraded and repurposed) and replaced by asw2-d-eqiad, a QFX5100/EX4300 stack. The servers that i... [12:46:33] paravoid: I have been thinking of the same actually, I'm using it on my sid laptop for a while, but would probably need some tests with the jessie version. will open a task [12:47:27] yeah, I've been using it for months on my (Debian testing) laptop as well, for whatever that's worth [12:47:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] [12:48:06] so net.netfilter.nf_conntrack_tcp_timeout_time_wait is 120 [12:49:17] but count vs max seems good [12:49:36] (03CR) 10Mark Bergsma: [C: 032] Reflect new FPC3 ports after cr1-/cr2-eqiad FPC5 decommissioning [dns] - 10https://gerrit.wikimedia.org/r/319617 (https://phabricator.wikimedia.org/T149196) (owner: 10Mark Bergsma) [12:49:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] [12:51:35] and now it is mw1202 [12:51:48] 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1005 - https://phabricator.wikimedia.org/T150256#2779434 (10faidon) [12:51:56] 06Operations, 10netops, 13Patch-For-Review: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2779450 (10mark) 05Open>03Resolved [12:52:28] !log upgrading pinkunicorn to varnish 4.1.3-1wm3 T150247 [12:52:33] apergos: it seems like the host is not able to contact memcached after boot [12:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:34] T150247: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247 [12:52:38] and then it goes away [12:52:50] it does seem to be transient on boot issue all right [12:53:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [12:54:22] apergos: let's wait a bit more to double check while investigating [12:54:28] sure [12:56:23] 06Operations, 10netops: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#2779462 (10faidon) 05Open>03Resolved a:03faidon This VRRP issue was the case before the cr1/2-eqiad upgrade as well, so this was likely due to some asw-d-eqiad snafu (not propagating VRRP multi... [12:56:45] 06Operations, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10Traffic, 07HTTPS: After login, user not logged in when "prefershttps" set to false and "wgSecureLogin" set to true - https://phabricator.wikimedia.org/T149977#2779467 (10BBlack) [12:59:51] 06Operations: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#2779478 (10MoritzMuehlenhoff) [12:59:56] RECOVERY - NTP on mw1194 is OK: NTP OK: Offset 0.0008826851845 secs [13:03:42] (03CR) 10Nikerabbit: [C: 04-1] "In this case it seems better just to remove the line completely. There might be special cases where it makes sense to repeat the defaults," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320352 (https://phabricator.wikimedia.org/T150146) (owner: 10Arseny1992) [13:04:19] (03CR) 10Gehel: [C: 032] maps - fix resource dependencies for tiles database creation [puppet] - 10https://gerrit.wikimedia.org/r/320381 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [13:05:25] hashar: o/ [13:05:32] do you have a minute for https://integration.wikimedia.org/ci/job/pplint-HEAD/25820/console ? [13:05:36] yeah [13:05:47] yeah tis bugged [13:05:54] I am not sure why it checks all the manifests [13:05:54] the job does a shallow clone [13:06:01] ah ok so I am not crazy [13:06:02] and attempts to lint files changed in HEAD^ [13:06:14] which turns out to be the whole repo on a second patchset in a chain [13:06:15] errr [13:06:24] eg you got (production) -> change A -> change B [13:06:34] on change B HEAD^ is somehow the whole repo [13:06:40] !log rolling reboot of restbase-test for kernel update [13:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:22] (03PS2) 10Arseny1992: Enable translation memory of Translate for frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320352 (https://phabricator.wikimedia.org/T150146) [13:08:27] (we are checkign mw1203 now, not repooled.) [13:08:53] <_joe_> elukey: I guess we should make hhvm.service depend on nutcracker.service [13:09:18] yeah [13:09:34] I needed to run puppet manually to populate /var/run/nutcracker [13:09:49] Dereckson , Nikerabbit , ^ ;) [13:10:29] elukey: gonna ninja fix it :D [13:10:40] * elukey hugs hashar [13:12:06] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:16] (03CR) 10Nikerabbit: [C: 031] "+1 but someone needs to run the script after deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320352 (https://phabricator.wikimedia.org/T150146) (owner: 10Arseny1992) [13:12:46] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [13:12:46] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:31] 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779504 (10BBlack) Pasting from IRC after testing the patch on cp1008/pinkunicorn (TL;DR - patch works as expected, bad retries gone): ``` 13:04 < bblack>... [13:13:46] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [13:14:16] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:36] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [13:14:47] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:16] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:15:26] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [13:15:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [13:15:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [13:15:46] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [13:15:54] things are slow :( [13:16:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [13:16:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] [13:17:40] ok so mw1203 is back in servie [13:17:43] *service [13:18:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [13:19:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:19:31] I see some errors from these hosts rebooted earlier [13:19:36] not as many but they add up [13:20:03] can this be puppet running? [13:20:17] mw1269 but puppet ran 9 minutes ago there [13:21:05] at this point I want to check mc* [13:21:06] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:21:13] (03CR) 10Faidon Liambotis: [C: 04-1] "This looks OK, but freeipmi is a metapackage depending on the individual different components of freeipmi and I'd rather depend explicitly" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [13:21:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [13:22:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [13:22:49] [2016-11-08 13:22:39.462] nc_connection.c:423 sendv on sd 19 failed: Broken pipe [13:23:08] lots of things like this in the logs [13:23:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:23:30] which ones? [13:23:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [13:23:59] /var/log/nutcracker/nutcracker.log on mw1269 [13:26:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] [13:27:29] elukey: should be fixed now ( https://gerrit.wikimedia.org/r/320389 ) [13:27:55] hashar: thanks! [13:28:16] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:28:46] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:46] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:46] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:36] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:30:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [13:30:46] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [13:30:46] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [13:30:46] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:30:55] ah snap iptables-dropped on mc* hosts [13:30:57] moritzm: --^ [13:31:16] damn.. [13:31:17] no sorry [13:31:18] old ones [13:31:21] false alarm [13:31:22] :P [13:31:23] ah whew [13:31:30] Moritz don't kill me please :D [13:31:56] it was an event in August [13:31:58] anyhow [13:32:04] so the errors we see now look like from hosts that have not been rebooted today: mw126* mw127* [13:32:10] if I am reading kibana right [13:32:46] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [13:34:09] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 970 threshold =0.1% breach: status: yellow, number_of_nodes: 30, unassigned_shards: 859, number_of_pending_tasks: 52, number_of_in_flight_fetch: 1436, timed_out: False, active_primary_shards: 3040, task_max_waiting_in_queue_millis: 122407, cluster_name: production-search-eqiad, relocating_shards: 0, ac [13:34:36] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [13:34:51] elukey: which host are you looking at? [13:34:53] * gehel is checking elasticsearch... [13:35:13] moritzm: mc* to see if there was a memcached issue, but I looked into old dmesg entries, sorry for the ping [13:35:40] ok :-) [13:35:40] (03PS1) 10Jcrespo: Revert "Repool db2042 - the maintenance is post poned as db2034 has hardware issues and cannot even receive all the data (T149553#2776069)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 [13:35:57] elasticsearch cluster is yellow, but recovering... no immediate danger... [13:36:01] good [13:36:16] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:26] (03CR) 10Jcrespo: "Let's doing anyway for reimage + upgrade." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 (owner: 10Jcrespo) [13:36:35] (03PS2) 10Jcrespo: Revert "Repool db2042 - the maintenance is post poned as db2034 has hardware issues and cannot even receive all the data (T149553#2776069)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 [13:37:35] apergos: can you confirm that this issue is happening also on mw servers that you didn't reboot? [13:37:46] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:46] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:46] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:46] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:46] to rule out some variables [13:37:56] this one looks flapping like our alarms [13:38:01] (03:32:04 μμ) apergos: so the errors we see now look like from hosts that have not been rebooted today: mw126* mw127* [13:38:01] (03:32:10 μμ) apergos: if I am reading kibana right [13:38:01] elasticsearch1024 and 1025 have left the cluster, checking [13:38:06] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:38:14] apergos: sorry didn't read it, thanks :) [13:38:16] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:24] (03PS3) 10Jcrespo: Depool db2042 for reimage + upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 [13:38:24] no worries, too much backread already [13:38:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [13:39:06] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:39:46] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [13:39:46] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:39:57] the ones in eqiad rebooted by me for eqiad were api servers only, 1189-1203. nothing else in eqiad at all [13:40:36] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [13:41:08] jouncebot: neilpquinn [13:41:12] jouncebot: next [13:41:12] In 0 hour(s) and 18 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1400) [13:41:18] !next [13:42:01] !log deferring reboots of mw1204-1216 and mw1170-1188 for a while [13:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:46] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [13:42:59] (03CR) 10Marostegui: [C: 031] Depool db1080 to deploy safely a long-running schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320382 (https://phabricator.wikimedia.org/T139090) (owner: 10Jcrespo) [13:43:16] (03Abandoned) 10Hashar: Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [13:43:32] (03CR) 10Marostegui: [C: 031] Depool db2042 for reimage + upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 (owner: 10Jcrespo) [13:43:56] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:06] (03Abandoned) 10Hashar: Add $wgMassMessageWikiAliases configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 (owner: 10Legoktm) [13:44:26] (03CR) 10Jcrespo: [C: 032] Depool db1080 to deploy safely a long-running schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320382 (https://phabricator.wikimedia.org/T139090) (owner: 10Jcrespo) [13:44:36] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [13:44:46] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [13:44:56] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [13:44:57] (03PS4) 10Jcrespo: Depool db2042 for reimage + upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 [13:44:58] 06Operations, 10Mobile-Content-Service, 10Reading Web Trending service, 07Service-deployment-requests, and 2 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2779628 (10mobrovac) p:05Triage>03High a:03mobrovac [13:45:06] PROBLEM - configured eth on wmf4750 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:16] PROBLEM - salt-minion processes on mc1033 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:16] PROBLEM - puppet last run on db1094 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:16] PROBLEM - puppet last run on mw1232 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:30] ACKNOWLEDGEMENT - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1146 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1047, number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3040, task_max_waiting_in_queue_millis: 28704, cluster_name: production-search-eqiad, relocating_shards: [13:46:06] RECOVERY - salt-minion processes on mc1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:46:06] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [13:46:06] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [13:46:14] elukey, still yellow? [13:46:32] (03Abandoned) 10Hashar: Add pixabay.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276069 (owner: 10Rillke) [13:46:44] sorry, wrong person [13:47:26] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [13:47:56] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:26] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.030 second response time [13:48:44] strange, response times of elasticsearch eqiad have been increasing slightly before the alert [13:48:56] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [13:50:06] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:50:16] hashar: ready for eu swat? :) [13:50:34] (03PS2) 10Hashar: LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [13:50:36] (03PS2) 10Hashar: Enable RevisionSlider (non beta feature) on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319539 (https://phabricator.wikimedia.org/T149725) (owner: 10Addshore) [13:50:46] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:51:00] gehel: when did it start? [13:51:03] *waves* [13:51:12] zeljkof: yeah I have rebased the patch for wmf-config [13:51:17] and already CR+2 the Kartographer patch [13:51:46] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [13:51:58] hashar and zeljkof, can we hold on swat for a little bit? [13:52:00] hashar: ok, so you are doing swat then today? [13:52:15] sorry to ask this but there's a couple issues we want to resolve first [13:52:26] apergos: sure. what is happening? [13:52:28] elukey: https://grafana-admin.wikimedia.org/dashboard/db/elasticsearch-percentiles?from=now-3h&to=now [13:52:35] elukey: looks like 13:10 UTC [13:52:43] sorry, I have to go... [13:53:07] gehel: memcached issue at the same time [13:53:10] elukey: dcausse is there if you need search expertise... [13:53:26] hashar: seeing some grafana errors for elastic and memcached and not sure of the cause [13:53:54] we definitely had some oddness with hhvm repools after server restarts but we're well past that now [13:54:02] scb was also flapping and I can see CHECK_NRPE errors [13:55:13] * hashar blames the mw background jobs [13:56:54] I have one wmf-config change merged but for obvious reasons not deployed, FYI [13:57:46] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [13:58:06] PROBLEM - configured eth on wmf4750 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:58:46] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:58:52] !log European SWAT on hold while some memcached/elasticsearch issues are being figured out [13:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:36] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1400). [14:00:04] Addshore and yurik: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:43] apergos: elukey: dcausse: who is looking at the ElasticSearch response time ? [14:00:50] Who's SWATing today? :) (hi) [14:01:04] seems the 99th response time surged to 2 minutes starting from 13:00 UTC [14:01:07] mafk: on hold for now [14:01:13] mafk -> hashar [14:01:15] mafk: but that would be me :) [14:01:19] possible network issue is being sorted out first, hashar [14:01:24] yep [14:01:33] blame the network first :D [14:01:54] hashar: ok, it's because I've got a patch I want to have merged but if you've not started yet I might even be able to make it :) [14:02:05] addshore around? [14:02:10] yup! [14:02:21] mafk: yeah just add your patch to the deployment section on https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1400 :) [14:02:25] here [14:02:39] yurik: swat on hold while some other prod issue is figured out [14:02:52] yep, I'm patching it, just don't finish before I'm done! :D [14:03:07] super easy change btw https://phabricator.wikimedia.org/T150252 [14:03:10] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 802, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3040, task_max_waiting_in_queue_millis: 146246, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_numb [14:03:16] heh [14:03:50] the save timing surged from 750ms to 2secs starting 13:00UTC but I guess that is all related [14:03:56] (from https://grafana.wikimedia.org/dashboard/db/performance-metrics?from=now-6h&to=now ) [14:04:07] :-) [14:05:10] hashar, is it ok to merge the labs config changes? [14:06:16] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:06:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [14:07:32] \o/ [14:08:07] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767284 (10Gilles) >>! In T149873#2767335, @aaron wrote: > The first approach might work using Varnish... [14:08:15] so if there is going to be a nutcracker restart, I would say let that go first and wait a couple mins, then we could presumably resume swat [14:08:32] (03PS1) 10MarcoAurelio: Set timezone for bdwikimedia to 'Asia/Dhaka' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320397 (https://phabricator.wikimedia.org/T150252) [14:08:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:09:33] addshore , fyi i proposed the revisionslider change as non-beta also to cawiki and hewiki. Due to both of them being in group1 and so them being ok for being early, I proposed to schedule to next wednesday (a day after your deploy to group0) . Waiting for some votes, and if local consensus out to be ok I'll prepare a task (or is there one already where wikis sign up for this, like [14:09:33] dewiki did?) and patch [14:09:44] !log rebooting chromium for kernel update [14:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:56] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:56] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:56] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:56] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:56] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:57] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:57] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:58] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:58] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [14:10:06] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:06] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:06] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:06] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:06] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:07] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:07] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:14] arseny92: we will deploy to wikis that request at the same time as we deploy to dewiki [14:10:16] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:16] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:10:16] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:16] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1003.eqiad.wmnet because of too many down! [14:10:21] there was a bunch of errors trying to connect to the pool counter server. Timed out until 14:00UTC [14:10:36] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!: ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down!: aqs_7232 - Could not depool server aqs1004.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1278.eqiad.wmnet because of too many down!: prometheus_80 - Co [14:10:37] please just make a ticket, and add the RevisionSlider tag arseny92 and we will add it to the patch! [14:10:46] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:46] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:46] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:46] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:46] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:47] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:56] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:08] ok icinga failures is getting out of hand here [14:11:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:11:29] can we stop breaking new things, and what's going on. [14:11:44] bblack, network problem on row D [14:11:46] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [14:11:46] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:11:46] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [14:11:47] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:49] yurik: nop. Holding everything until prod is all fine [14:11:55] oki [14:11:56] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [14:11:56] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [14:11:56] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [14:11:59] and there is a DB patch that has to be synced [14:12:11] I am holding that for now [14:12:11] yurik: I am not worried about your changes though :D [14:12:29] PYBAL CRITICAL - rendering_80 - Could not depool server mw1293.eqiad.wmnet because of too many down!: ores_8081 - Could not depool server scb1003.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1206.eqiad.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus [14:12:35] 1001.eqiad.wmnet because of too many down!: kartotherian_6533 - Could not depool server maps1004.eqiad.wmnet because of too many down!: search-https_9243 - Could not depool server elastic1035.eqiad.wmnet because of too many down!: graphoid_19000 - Could not depool server scb1003.eqiad.wmnet because of too many down!: mobileapps_8888 - Could not depool server scb1001.eqiad.wmnet because of too [14:12:36] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [14:12:41] many down!: wdqs_80 - Could not depool server wdqs1001.eqiad.wmnet because of too many down!: aqs_7232 - Could not depool server aqs1006.eqiad.wmnet because of too many down!: cxserver_8080 - Could not depool server scb1001.eqiad.w [14:12:45] how do we have this many different services with "too many down" ???? [14:12:46] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [14:12:46] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [14:12:46] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [14:12:46] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:12:46] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [14:12:47] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [14:12:47] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [14:12:48] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:12:56] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [14:12:56] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:12:56] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [14:12:56] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:12:56] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:12:57] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [14:12:57] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [14:12:59] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [14:12:59] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:12:59] addshore isn't there a ticket already for wiki signups, as I don't want to do duplicates [14:13:00] oh nevermind, ignore that one, that's an inactive LVS [14:13:01] still [14:13:06] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:13:06] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [14:13:16] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [14:13:18] lvs1003 is in service and has similar [14:13:20] PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down!: apaches_80 - Could not depool server mw1253.eqiad.wmnet because of too many down!: search-https_9243 - Could not depool server elastic1030.eqiad.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus1002.eqiad.wmnet because of too many down!: rendering_80 - Cou [14:13:26] ld not depool server mw1296.eqiad.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs1001.eqiad.wmnet because of too many down!: mobileapps_8888 - Could not depool server scb1003.eqiad.wmnet because of too many down!: restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1018.eqiad.wmnet beca [14:13:32] use of too many down!: thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!: ores_8081 - Could not depool server scb1004.eqiad.wmnet because of too many down!: aqs_7232 - Could not depool server [14:13:43] eventbus, apaches, search, prometheus, restbase, mobileapps, thumbor, ??? [14:14:39] arseny92: no, please just make a ticket asking for it to be moved out of beta (for example https://phabricator.wikimedia.org/T149995) [14:16:33] Maybe it's related to what it's happening, but I'm getting logged-out repeatedly [14:17:05] 06Operations, 10Monitoring, 10netops: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264#2779751 (10faidon) [14:17:09] ok, but need to wait a while, so maybe tomorrow, i like just proposed that. Just informing you [14:17:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [14:17:30] bblack: well they are/have recoved [14:17:35] recovered [14:18:44] yeah they are [14:18:54] and speaking of arwiki, did you already prepare the patch? [14:19:44] hey is the 503 issue common knowledge? [14:20:08] not sure [14:20:11] I get seemingly random 503 errors which go away up on refresh [14:20:34] still? [14:20:39] (and since when?) [14:21:18] 14:07-14:13 seems like the outer boundary of the known 503-spike [14:21:34] (that's ~14 minutes ago to ~8 minutes ago) [14:21:43] it happened quite a bit yesterday in #wikimedia-commons some other user was reporting it [14:22:14] looks like for the last 10 minutes we have 5xx reported by icinga here [14:22:26] for eqiad [14:22:52] icinga's anomaly reports are always laggy and strange though [14:23:10] when we have spikes, they report them late, and persist in reporting them long after they're gone heh [14:23:16] yeah [14:23:33] yeah, I got a 503 when tagging an image for deletion at 15:11 (https://commons.wikimedia.org/wiki/MediaWiki_talk:Gadget-AjaxQuickDelete.js/auto-errors#Autoreport_by_AjaxQuickDelete_468995897421) [14:23:35] that is due to how icinga scheduled the checks and retries them X times before actually sending the notification [14:24:47] eg it might check every 5 minutes, retry up to 3 times every 1 minute before triggering notifications [14:25:12] ToAruShiroiNeko: if you're having longer term problems that started yesterday, should probably raise that separately with a ticket and details [14:25:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:25:29] I did not pay attention to it [14:25:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:25:52] (03PS2) 10Giuseppe Lavagetto: Add travis build support [software/conftool] - 10https://gerrit.wikimedia.org/r/320375 [14:27:28] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2779776 (10Marostegui) @Papaul can we do this on Thursday? On Wednesday night I will take a snapshot of dbstore2001 so by Thursday we should be good to go on Thursday. I have been talking to @Vo... [14:27:42] should I process with the few patches we have for SWAT ? [14:29:13] asking the folsk who worked on the issue [14:29:40] (03PS3) 10Giuseppe Lavagetto: Add travis build support [software/conftool] - 10https://gerrit.wikimedia.org/r/320375 [14:30:46] (03CR) 10Giuseppe Lavagetto: [C: 032] Add travis build support [software/conftool] - 10https://gerrit.wikimedia.org/r/320375 (owner: 10Giuseppe Lavagetto) [14:31:52] still getting issues [14:32:03] Failed to load resource: the server responded with a status of 503 () [14:32:05] yeah [14:32:26] I am not even doing anything Josve05a :p [14:32:46] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:46] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - Apache HTTP on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:07] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:07] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:16] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:16] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:16] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:33:26] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:33:26] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1003.eqiad.wmnet because of too many down!: thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1234.eqiad.wmnet because of too many down! [14:33:26] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:33:26] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:27] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:27] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:36] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:36] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:36] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:36] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1003.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1281.eqiad.wmnet because of too many down! [14:33:36] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1003.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1234.eqiad.wmnet because of too many down! [14:33:37] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:33:37] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:38] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:38] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:39] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:40] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:40] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:40] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:41] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:46] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:33:46] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:46] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:33:46] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:46] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:47] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:58] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:58] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:59] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:59] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:00] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:00] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:01] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:01] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:06] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [14:34:06] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:06] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:06] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:06] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:07] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:07] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:08] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:08] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:16] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:16] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 7.590 second response time [14:34:16] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:16] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:16] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [14:34:17] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - api_80 - Could not depool server mw1288.eqiad.wmnet because of too many down! [14:34:20] hashar, to state the obvious, no swat for now [14:34:26] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:34:26] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.622 second response time [14:34:26] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 9.389 second response time [14:34:26] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.516 second response time [14:34:26] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 7.603 second response time [14:34:36] apergos: yeah :-} [14:34:36] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.302 second response time [14:34:36] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [14:34:36] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 7.216 second response time [14:34:36] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.825 second response time [14:34:36] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 6.985 second response time [14:34:37] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.018 second response time [14:34:37] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 9.256 second response time [14:34:38] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.522 second response time [14:34:38] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:34:39] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:34:46] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 4.988 second response time [14:34:54] (03PS1) 10Jcrespo: Depool db servers on row D except es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) [14:34:56] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.332 second response time [14:34:56] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 4.644 second response time [14:35:06] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:06] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.843 second response time [14:35:06] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.974 second response time [14:35:06] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy [14:35:06] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 6.264 second response time [14:35:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:35:27] HmMm [14:35:28] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:35:28] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 9.520 second response time [14:35:28] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 1.027 second response time [14:35:33] (03CR) 10jenkins-bot: [V: 04-1] Depool db servers on row D except es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) (owner: 10Jcrespo) [14:35:36] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [14:35:40] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 6.406 second response time [14:35:40] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:35:46] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.980 second response time [14:35:46] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [14:35:47] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:35:47] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 2.345 second response time [14:35:47] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.140 second response time [14:35:47] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.687 second response time [14:35:56] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.987 second response time [14:35:56] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [14:35:56] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.995 second response time [14:35:56] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.984 second response time [14:35:56] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 2.398 second response time [14:35:57] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [14:35:57] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 72211 bytes in 1.945 second response time [14:36:06] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [14:36:06] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [14:36:06] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:36:06] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [14:36:07] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:36:16] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [14:36:16] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [14:36:16] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 72209 bytes in 0.089 second response time [14:36:16] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy [14:36:16] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [14:36:36] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [14:36:36] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy [14:36:46] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:36:46] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [14:36:46] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [14:36:46] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [14:36:46] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [14:36:47] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [14:36:47] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [14:36:48] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [14:36:48] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [14:36:49] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [14:36:49] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [14:36:50] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [14:36:56] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:36:56] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:36:56] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [14:36:56] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [14:36:56] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:36:57] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [14:36:57] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:36:58] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [14:36:58] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [14:36:59] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:38:08] guessing we're coming back online? [14:38:16] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:38:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [14:38:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [1000.0] [14:38:46] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1725 bytes in 0.117 second response time [14:38:56] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:44:58] (03PS2) 10Arseny1992: Enable RevisionSlider (non BetaFeature) on dewiki and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) (owner: 10Addshore) [14:46:33] (03PS2) 10Jcrespo: Depool db servers on row D except es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) [14:49:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:50:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:51:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [14:53:20] (03CR) 10Marostegui: Depool db servers on row D except es1019 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) (owner: 10Jcrespo) [14:58:40] hiii gehel, yt? [15:03:26] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:03:36] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:03:54] ottomata: gehel said he had to go roughly one hour ago (13:53 UTC) [15:04:35] ah k [15:05:46] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:29] what happened? [15:09:13] we're not sure yet, still investigating. ignoring all the icinga spam above, the primary public fallout seems to be 2x spikes of 503 responses around 14:07->14:13 and 14:30->14:37 [15:09:43] ok [15:10:47] without a definitive cause, it's hard to say whether whatever it is will eventually happen again or not. but so far things seem stable since. [15:12:47] (03PS3) 10Jcrespo: Depool db servers on row D except es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) [15:14:02] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2779859 (10yuvipanda) I wonder if it'll be better to do this next quarter. We've already done a few bits of pretty disruptive maintenance, and have on... [15:15:12] ottomata: I'm back [15:16:24] yurik: I am pushing your Kartographer change https://gerrit.wikimedia.org/r/#/c/320345/1 [15:16:34] ok [15:16:37] heya, gehel, jmxtrans q [15:16:46] was wondering if there was a config to join together multiple typeNames in a key [15:16:51] using something other than _ [15:16:58] (03CR) 10Marostegui: [C: 032] Depool db servers on row D except es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) (owner: 10Jcrespo) [15:17:09] _ makes the keys hard to query in grafana, e.g. [15:17:20] kafka701_analytics_eqiad_wmflabs_9997.kafka.consumer.ConsumerTopicMetrics.MessagesPerSec_kafka-mirror-main-analytics_to_analytics-0_test_otto2.FifteenMinuteRate [15:17:24] MessagesPerSec_kafka-mirror-main-analytics_to_analytics-0_test_otto2 [15:17:29] is 3 type names [15:17:29] (03Merged) 10jenkins-bot: Depool db servers on row D except es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320401 (https://phabricator.wikimedia.org/T148506) (owner: 10Jcrespo) [15:17:35] ['name', 'clientId', 'topic'], [15:17:48] would be much nicer if this came out as [15:17:55] MessagesPerSec.kafka-mirror-main-analytics_to_analytics-0.test_otto2 [15:18:15] yurik: wanna test it on mw1099 ? [15:18:31] checking... [15:18:45] ottomata: I'll have to check... There is a long standing idea to implement a better templating mechanism [15:19:18] ottomata: but this part of jmxtrans is half black magic... [15:21:17] yeah, was just looking around in code gehel [15:21:37] for a sec i thought allowDottedKeys would do it, but that just keeps . from being removed from keys [15:21:51] hashar, all good [15:22:08] gehel: , although [15:22:09] https://github.com/jmxtrans/jmxtrans/blob/b405303de339f51bfd373d113709a8ee939692de/jmxtrans-core/src/main/java/com/googlecode/jmxtrans/model/naming/KeyUtils.java#L118-L124 [15:22:18] looks like the typeName seperator should be '.' [15:22:18] hm [15:22:21] maybe my version is too old.. [15:22:55] 242 [15:23:11] yurik: syncing [15:23:24] ottomata: yes, a lot has happens since 242 [15:23:53] !log hashar@tin Synchronized php-1.29.0-wmf.1/extensions/Kartographer/modules/maplink/maplink.js: Search .mw-body instead of #content to support all the skins - T150148 (duration: 00m 47s) [15:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:59] T150148: does not work on "Modern" skin - https://phabricator.wikimedia.org/T150148 [15:24:59] hashar: don't deploy [15:25:04] ahh, yeah, gehelhttps://github.com/jmxtrans/jmxtrans/blob/v242/src/com/googlecode/jmxtrans/util/JmxUtils.java#L744-L757 [15:25:08] i gues sin new version it just uses . [15:25:10] instead of _ [15:25:12] which is what I want... [15:25:15] HmM! [15:25:24] yurik: then the rest of swat changes I am claiming them as postponed/cancelled [15:25:25] anyone: don't deploy, we're still investigating issues from earlier [15:25:32] elukey: can you help remind me what the obstacle to upgrading jmxtrans was? [15:25:41] i used to know, but have forgotten, do you remember? [15:25:47] i think it was a logging bug, but i betcha its been fixed [15:26:39] ottomata: we needed to package the last upstream [15:26:53] and the test it on kafka/hadoop nodes [15:27:15] (03PS1) 10Marostegui: This was +2'ed by mistake [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320408 [15:27:30] hm [15:28:11] (03PS2) 10Marostegui: Revert "Depool db servers on row D except es1019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320408 [15:28:27] (03CR) 10Hashar: [C: 031] "This change only impacts beta cluster. please deploy at anytime if production is all clear (which is not right now)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319539 (https://phabricator.wikimedia.org/T149725) (owner: 10Addshore) [15:28:51] (03CR) 10Hashar: "This change only impacts beta cluster. please deploy at anytime if production is all clear (which is not right now)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [15:28:56] (03CR) 10Hashar: [C: 031] LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [15:29:06] marostegui: more clear to put the revert line first, the comment afterwards, when you check the log with a oneline formatter [15:29:29] Dereckson: Ah, thanks for the correction :) [15:29:43] Dereckson: shall I +2 that then so it gets merged too? [15:30:33] marostegui: if it has been merged but not deployed, yes, so master state = what is in prod [15:30:41] elukey: yargh, yeah, and that is an undertaking... [15:30:41] hm [15:31:07] (03CR) 10Muehlenhoff: "That patch is not correct, all of those dictionaries do exist in jessie or jessie-wikimedia, I just doublechecked. The missing African lan" [puppet] - 10https://gerrit.wikimedia.org/r/319898 (owner: 10Yuvipanda) [15:31:12] Dereckson: yep, nothing was deployed, so I will +2 it then. Thanks [15:31:14] (03CR) 10Marostegui: [C: 032] Revert "Depool db servers on row D except es1019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320408 (owner: 10Marostegui) [15:31:58] (03Merged) 10jenkins-bot: Revert "Depool db servers on row D except es1019" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320408 (owner: 10Marostegui) [15:34:32] (03CR) 10Hashar: [C: 031] "This change has not been deployed since production had issue during the European SWAT window. Please add it to the next window :}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320397 (https://phabricator.wikimedia.org/T150252) (owner: 10MarcoAurelio) [15:35:11] (03PS1) 10Muehlenhoff: Only install python-pygeoip on Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/320410 (https://phabricator.wikimedia.org/T150003) [15:36:23] (03CR) 10Dereckson: [C: 031] "Asia/Dhaka is a valid timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320397 (https://phabricator.wikimedia.org/T150252) (owner: 10MarcoAurelio) [15:36:58] * Dereckson will read full gerrit comments and not only irc first line to avoid to be an echo chamber [15:37:31] (03CR) 10Ottomata: [C: 031] Only install python-pygeoip on Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/320410 (https://phabricator.wikimedia.org/T150003) (owner: 10Muehlenhoff) [15:37:56] well so we checked two sources, iana and HHVM DateTimeZone::listIdentifiers() output [15:39:07] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2779893 (10jcrespo) > I wonder if it'll be better to do this next quarter. I am ok with next quarter- let's set a time. I have workarounded the 5.5 s... [15:39:24] Dereckson: yeah I used hhvm on terbium to confirm [15:41:39] (03PS2) 10Andrew Bogott: Add some error handling to wikistatus, and make more thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/320335 [15:41:50] (03CR) 10Muehlenhoff: [C: 032] Only install python-pygeoip on Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/320410 (https://phabricator.wikimedia.org/T150003) (owner: 10Muehlenhoff) [15:42:22] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [15:42:38] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/288620 (owner: 10Hashar) [15:43:46] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:44:36] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:45:07] (03PS3) 10Andrew Bogott: Add some error handling to wikistatus, and make more thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/320335 [15:47:51] (03PS1) 10Jcrespo: mariadb: Pool db1052 to help with the extra api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320414 [15:47:59] (03PS4) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 [15:48:09] (03CR) 10Andrew Bogott: [C: 032] Add some error handling to wikistatus, and make more thread-safe [puppet] - 10https://gerrit.wikimedia.org/r/320335 (owner: 10Andrew Bogott) [15:48:11] (03CR) 10Hashar: [C: 031] "I have changed the Jenkins job to one that does a full clone ( https://gerrit.wikimedia.org/r/320407 ) and this way HEAD^ is valid :}" [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [15:48:24] (03CR) 10Hashar: [C: 031] "I have changed the Jenkins job to one that does a full clone ( https://gerrit.wikimedia.org/r/320407 ) and this way HEAD^ is valid :}" [puppet] - 10https://gerrit.wikimedia.org/r/288620 (owner: 10Hashar) [15:55:04] (03CR) 10Marostegui: [C: 031] mariadb: Pool db1052 to help with the extra api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320414 (owner: 10Jcrespo) [15:56:20] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1052 to help with the extra api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320414 (owner: 10Jcrespo) [15:56:55] ottomata: I'll try to spend my 10% time this Friday on jmxtrans packaging, this has been stalled for far too long! [15:57:12] (03PS1) 10Faidon Liambotis: smokeping: monitor more hosts, at least 1 per row [puppet] - 10https://gerrit.wikimedia.org/r/320416 [15:57:30] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2779921 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Done, puppet runs on notebook* restored. [15:59:53] (03CR) 10Faidon Liambotis: [C: 032] smokeping: monitor more hosts, at least 1 per row [puppet] - 10https://gerrit.wikimedia.org/r/320416 (owner: 10Faidon Liambotis) [16:00:37] gehel: :) [16:01:11] gehel: beware though, its possible prometheus jmx exporter may take over our jmxtrans use case in the future, not sure [16:01:52] ottomata: that would probably not be a bad solution! But a better packaging of jmxtrans is needed in any case [16:02:00] aye [16:02:32] ottomata: I still don't know much about prometheus, but a jmxtrans prometheus writer might make sense [16:03:02] yeah it probably would [16:03:12] that would make transitioning either. but it hink prometheus is pull based [16:03:22] not totally sure ^ cc godog [16:03:26] easier** [16:04:34] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1052; depool db1080; reorganize trafic weight for s1 (duration: 00m 46s) [16:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:47] hashar, there was 4 undeployed changes on tin [16:05:00] jynus: none by me at least [16:05:19] or did I screwed up the rebase I made earlier bah :( [16:05:19] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2779979 (10yuvipanda) Ok. Early January? [16:05:27] 2 were mine [16:05:57] maybe 4 if it has into account the merge and revert, I think it doesn't [16:06:27] I have commented on all the wmf-config patch that they should be deployed later [16:06:46] at least couple are for beta cluster so I guess they will land at some point out of the swat slots [16:06:49] I think, if this patch is ok [16:06:49] ottomata: damn, I forgot how ugly that part of jmxtrans is... It does look like "allowDottedKeys" should do what you need in the latest jmxtrans... [16:07:02] we can go back to normal [16:07:07] naw, gehel, in latest, it looks like typeNames are already delimited by '.' [16:07:08] by default [16:07:13] but we'll see [16:07:29] allowDottedKeys would just keep '.' in key parts from jmx to not be replaced [16:07:43] afaict [16:08:54] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2779984 (10jcrespo) January ok, but after the 15th. [16:09:07] ottomata: only if "allowDottedKeys": https://github.com/jmxtrans/jmxtrans/blob/master/jmxtrans-core/src/main/java/com/googlecode/jmxtrans/model/Query.java#L244-L254 [16:11:57] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1052; depool db1080; reorganize trafic weight for s1 -second try (duration: 00m 46s) [16:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:44] OHHH huh [16:13:21] ok thanks gehel [16:18:04] !log upgrading cache_text codfw to varnish 4.1.3-1wm3 T150247 [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:10] T150247: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247 [16:21:08] !log mwscript --deleteEqualMessages.php --wiki kkwiki (T45917) [16:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:14] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [16:24:21] !log performing schema change templatelinks on db1080 T139090 [16:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:29] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [16:26:23] so it seems there's no EuroSWAT today [16:26:32] we are serving >23K queries per second on 2 servers without appreciable latency impact [16:28:16] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3124: Connection refused [16:28:16] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3121: Connection refused [16:28:16] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3128: Connection refused [16:28:16] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3127: Connection refused [16:28:26] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3123: Connection refused [16:28:26] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3125: Connection refused [16:28:26] that's me, host depooled ^ [16:28:56] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3122: Connection refused [16:28:56] PROBLEM - Varnish HTTP text-frontend - port 80 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused [16:29:06] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 49 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[generate varnish.pyconf] [16:29:06] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3120: Connection refused [16:29:06] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp2013 is CRITICAL: connect to address 10.192.32.112 and port 3126: Connection refused [16:30:44] jouncebot: refresh [16:30:47] I refreshed my knowledge about deployments. [16:30:55] jouncebot: next [16:30:55] In 0 hour(s) and 29 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1700) [16:31:10] so, do I reboot a few more hosts? yeah wth [16:31:16] might as well sneak them in now [16:31:55] !log rolling restart of mw1204-1208 for new kernel [16:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:56] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 0.074 second response time [16:32:56] RECOVERY - Varnish HTTP text-frontend - port 80 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 0.072 second response time [16:32:56] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:33:06] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.073 second response time [16:33:06] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [16:33:16] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [16:33:16] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.073 second response time [16:33:16] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [16:33:16] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.072 second response time [16:33:27] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [16:33:27] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.074 second response time [16:35:54] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2780028 (10Cmjohnson) Replaced the disk at slot 4 [16:39:32] (03PS6) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [16:54:29] (03PS3) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [16:57:49] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2780080 (10faidon) [16:57:56] (03CR) 10jenkins-bot: [V: 04-1] Enable cluster-wide import setup in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258943 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [16:59:38] (03PS4) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1700). [17:00:04] yurik: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:14] yep [17:00:21] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor by default for all users of the Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292748 (https://phabricator.wikimedia.org/T136995) (owner: 10Jforrester) [17:00:29] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Let Wikidata editors edit at a higher rate than on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester) [17:00:30] is everything back to normal in prod? [17:00:33] (03CR) 10jenkins-bot: [V: 04-1] Disable wgIncludeLegacyJavaScript on all sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277823 (owner: 10Jforrester) [17:00:37] (03CR) 10jenkins-bot: [V: 04-1] Increase default thumbnail display size from 220px to 300px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (https://bugzilla.wikimedia.org/67709) (owner: 10Jforrester) [17:00:40] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Make VisualEditor access RESTbase directly on private wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200107 (owner: 10Jforrester) [17:00:44] (03CR) 10jenkins-bot: [V: 04-1] Set wgSemiprotectedRestrictionLevels for de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) (owner: 10Dereckson) [17:00:47] (03CR) 10jenkins-bot: [V: 04-1] VisualEditor: Enabled for logged-out users on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242042 (https://phabricator.wikimedia.org/T90662) (owner: 10Jforrester) [17:01:00] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add composer test for coding standards and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (owner: 10Jforrester) [17:01:04] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "Move sourceswiki special.dblist->wikisource.dblist"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227738 (owner: 10Alex Monk) [17:01:08] (03CR) 10jenkins-bot: [V: 04-1] Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk) [17:01:11] (03CR) 10jenkins-bot: [V: 04-1] Add "composer test" command to lint files and run tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [17:01:52] yurik: normal enough [17:02:28] hm, ok, i will push labs config change [17:02:59] (03CR) 10Yurik: [C: 032] LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [17:03:46] PROBLEM - NTP on mw1205 is CRITICAL: NTP CRITICAL: Offset unknown [17:03:47] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2780082 (10Papaul) Talked with HP again today . I will be receiving a 1TB 7.2k 3.5" disk by tomorrow. According to them they no longer carry 500GB 7.2K 3.5" disks [17:04:16] ntp is me, I guess I will kick it in a few minutes if it doesn't recover [17:04:45] 319539 can also be pushed as it also only affects beta [17:05:17] I was gonna push a couple of beta only patches too... [17:05:36] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89930.90 seconds [17:05:36] does this mean a sync around the cluster? [17:05:44] -labs.php files [17:05:46] RECOVERY - NTP on mw1205 is OK: NTP OK: Offset 0.0002099573612 secs [17:06:15] and there's the ntp recovery, thank you [17:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:06:40] oh come on [17:06:52] Reedy: how much work would it be to rename '-labs' to -beta? [17:07:10] yuvipanda: probably not much [17:07:24] Maybe a bit of downtime in beta, but minimal [17:07:26] Reedy: can I bribe you into it? [17:07:34] What're you bribing me with? [17:07:45] * Reedy grins [17:08:12] beer [17:08:31] (03PS2) 10Reedy: Remove OATHAuth from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320334 [17:08:33] (03CR) 10Reedy: [C: 032] Remove OATHAuth from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320334 (owner: 10Reedy) [17:08:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:09:11] (03Merged) 10jenkins-bot: Remove OATHAuth from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320334 (owner: 10Reedy) [17:09:44] (03PS3) 10Reedy: Add PageViewInfo to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320333 (https://phabricator.wikimedia.org/T129602) [17:09:48] (03CR) 10Reedy: [C: 032] Add PageViewInfo to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320333 (https://phabricator.wikimedia.org/T129602) (owner: 10Reedy) [17:09:48] so no oath will be enabled for evveryone? [17:09:52] No [17:10:03] The config for beta was a duplicate of productions [17:10:14] So it was pointless [17:10:21] (03Merged) 10jenkins-bot: Add PageViewInfo to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320333 (https://phabricator.wikimedia.org/T129602) (owner: 10Reedy) [17:10:32] so its inherited from prod? [17:10:35] No [17:10:40] It uses prods config [17:10:42] always has [17:10:55] thats what i meant tho [17:12:10] !log reedy@tin Synchronized wmf-config/extension-list-labs: Add PageViewInfo (duration: 00m 46s) [17:12:12] next step would be testing oath in labs for normal wikis? [17:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:06] Reedy, oh, are you syncing? [17:13:09] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: Add PageViewInfo (duration: 00m 46s) [17:13:10] yuvipanda: yeah [17:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:27] commonsettings-labs now [17:13:39] Reedy, could you also do my patch plz? i just+2ed [17:13:53] https://gerrit.wikimedia.org/r/#/c/320344/ [17:13:59] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2780119 (10Papaul) @Marostegui yes Thursday 10:00 am works for me. [17:14:08] yurik: gerrit says cannot merge [17:15:00] and https://gerrit.wikimedia.org/r/#/c/319539/ [17:15:07] oh, it's ontop of addshore [17:15:16] !log reedy@tin Synchronized wmf-config/CommonSettings-labs.php: Add PageViewInfo, Remove dupe OATHAuth config (duration: 00m 47s) [17:15:17] O_o [17:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:27] addshore: https://gerrit.wikimedia.org/r/#/c/319539/2 [17:15:29] You're the worst [17:15:32] (03PS3) 10Reedy: Enable RevisionSlider (non beta feature) on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319539 (https://phabricator.wikimedia.org/T149725) (owner: 10Addshore) [17:15:34] blame hashar ! [17:15:38] (03CR) 10Reedy: [C: 032] Enable RevisionSlider (non beta feature) on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319539 (https://phabricator.wikimedia.org/T149725) (owner: 10Addshore) [17:16:09] (03Merged) 10jenkins-bot: Enable RevisionSlider (non beta feature) on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319539 (https://phabricator.wikimedia.org/T149725) (owner: 10Addshore) [17:16:24] (03PS3) 10Reedy: LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [17:16:28] (03PS4) 10Yurik: LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 [17:16:34] (03CR) 10Reedy: [C: 032] LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [17:16:58] Reedy, not sure what that was, but fixed [17:17:06] dependent patch it seems [17:17:18] (03PS5) 10Reedy: LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [17:17:26] (03CR) 10Reedy: [C: 032] LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [17:17:32] DAMN IT JENKINS [17:17:42] uh [17:17:44] GERRIT [17:18:06] (03Merged) 10jenkins-bot: LABS: added beta.wmflabs.org to graphs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320344 (owner: 10Yurik) [17:19:19] !log upgrade finished -> cache_text codfw to varnish 4.1.3-1wm3 T150247 [17:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:25] T150247: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247 [17:19:36] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enable Revision Slider (duration: 00m 47s) [17:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:50] addshore care for your workboards of T149725 ;) [17:19:50] T149725: Enable RevisionSlider (non betafeature) on beta sites - https://phabricator.wikimedia.org/T149725 [17:20:17] (03CR) 10jenkins-bot: [V: 04-1] Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [17:20:25] hah arseny92 I will in due time! [17:20:58] !log reedy@tin Synchronized wmf-config/CommonSettings-labs.php: Graphs config (duration: 00m 47s) [17:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:22] (03PS1) 10Jcrespo: mariadb: repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320422 (https://phabricator.wikimedia.org/T139090) [17:24:34] (03PS1) 10Jcrespo: Depool db1089 to safely apply pending schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320423 (https://phabricator.wikimedia.org/T139090) [17:25:38] (03PS1) 10Ottomata: Update jmxtrans for confluent kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/320424 (https://phabricator.wikimedia.org/T143320) [17:27:15] (03CR) 10jenkins-bot: [V: 04-1] Update jmxtrans for confluent kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/320424 (https://phabricator.wikimedia.org/T143320) (owner: 10Ottomata) [17:27:20] !log rolling restarts of mw1209 - mw1216 for new kernel [17:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:47] I should deploy 320343, yurik ? [17:27:49] jynus: ^^ fyi [17:28:13] (03PS2) 10Jcrespo: LABS: added beta.wmflabs.org to graphs config [puppet] - 10https://gerrit.wikimedia.org/r/320343 (owner: 10Yurik) [17:28:48] (03PS2) 10Ottomata: Update jmxtrans for confluent kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/320424 (https://phabricator.wikimedia.org/T143320) [17:29:55] (03PS1) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [17:30:20] Reedy, it says that labs config should match that, I will deploy that [17:30:26] Thanks [17:30:27] (03CR) 10jenkins-bot: [V: 04-1] Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) (owner: 10Reedy) [17:30:56] I was more like asking a querstion (?) [17:31:12] I assume that is a yes [17:31:32] (03CR) 10Jcrespo: [C: 032] LABS: added beta.wmflabs.org to graphs config [puppet] - 10https://gerrit.wikimedia.org/r/320343 (owner: 10Yurik) [17:32:16] (03CR) 10Ottomata: [C: 032] Update jmxtrans for confluent kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/320424 (https://phabricator.wikimedia.org/T143320) (owner: 10Ottomata) [17:32:21] (03PS3) 10Ottomata: Update jmxtrans for confluent kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/320424 (https://phabricator.wikimedia.org/T143320) [17:32:23] (03CR) 10Ottomata: [V: 032] Update jmxtrans for confluent kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/320424 (https://phabricator.wikimedia.org/T143320) (owner: 10Ottomata) [17:32:38] (03PS2) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [17:32:57] reedy, the symlinks in docroot need to also be updated as part of 320425 [17:33:04] addshore: I already have [17:33:08] ugh [17:33:11] arseny92: I already have [17:33:14] I just didn't run the script [17:35:46] (03PS1) 10Reedy: Fix Enable -> Use for PageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320426 [17:35:48] * Reedy grumbles [17:36:03] (03CR) 10Reedy: [C: 032] Fix Enable -> Use for PageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320426 (owner: 10Reedy) [17:36:20] (03CR) 10Jcrespo: [C: 032] mariadb: repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320422 (https://phabricator.wikimedia.org/T139090) (owner: 10Jcrespo) [17:37:33] (03PS2) 10Reedy: Fix Enable -> Use for PageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320426 [17:37:39] (03CR) 10Reedy: [C: 032] Fix Enable -> Use for PageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320426 (owner: 10Reedy) [17:38:09] (03Merged) 10jenkins-bot: Fix Enable -> Use for PageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320426 (owner: 10Reedy) [17:39:32] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: Fix variable typo (duration: 00m 59s) [17:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:11] (03PS3) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [17:40:38] yuvipanda: ^ Should be good to go... I'll just make sure I babysit that one to fix any other weird and wonderful [17:41:36] !log mwscript --deleteEqualMessages.php --wiki ptwikinews (T45917) [17:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:42] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [17:42:06] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [17:42:07] 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2780189 (10BBlack) I've confirmed with codfw upgraded (the DC that's currently facing RestBase directly), this is fixed up even from ulsfo clients' perspec... [17:42:50] * mafk wonders @ Krinkle if the deleteEqualMessages could be cronned so he does not have to run it manually :D [17:43:02] and/or foreachwiki [17:43:22] mafk: I'm not running them multiple times for the same wiki. (not usually anyway) [17:43:34] foreachwiki == all wikis with just one command? [17:43:56] Reedy: I do it only when performing other clean up on the given wiki, usually related to broken and/or deprecated legacy javascript. [17:43:57] Krinkle, question- do those delete things from the database? e.g. l18n? [17:43:58] Krinkle: no complains! It's just that I feel bad for you having to run it for +700 wikis manually :D [17:44:24] Yeah, it will. Pages in the MediaWiki namespace jynus [17:44:41] how many on a large wiki? [17:44:45] a few or a lot? [17:44:48] Although the vast majority are not indexed by localisation cache [17:44:51] local messages identical to translatewiki.net translations [17:44:56] because they're unused. [17:45:06] I am asking because delete does not save space [17:45:07] sometimes they're active messages that are just upstreamed to translatewiki indeed [17:45:17] jynus: I understand. It's not to save space. [17:45:22] I know [17:45:29] but I want to! :-) [17:45:36] It's to avoid drift. If I upstream a message to translatewiki, and then another update happens there, the local override stays. [17:45:39] jynus: delete eswiki :P [17:45:55] Plus, MW doing stuff like getting contents of the pages [17:45:56] so I am asking to run some commands afterwards to retrieve space [17:46:31] if it is just a few messages, it is not worth it [17:46:34] jynus: It stops things like... "THE MESSAGES AREN'T UPDATING FROM TRANSLATEWIKI" [17:46:34] Where the answer is, they copied it locally [17:46:37] Also, it makes it very difficult to find anything when there are so many overrides. [17:46:37] jynus: Usually between 0 and 10 on most wikis. [17:46:39] so ofc it won't [17:46:42] The larger the wiki the less outdated message overrides. [17:46:48] Because they know how to do it properly [17:46:49] oh, so very low [17:46:55] :-) [17:47:03] I usually find 1 or 2 small wikis a year that have 100+ overrides from 2006 or something. [17:47:07] * mafk <3 deleteEqualMessages.php [17:47:36] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:38] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2780195 (10fgiunchedi) mtail barfs to parse/convert a log line like in {P4391} and after that it looks like no more metrics are sent [17:47:46] jynus: retreiving the same is somewhat controversial, as it would erase people's contributions. [17:48:01] would certainly require community consultation. [17:48:12] oh, I wasn't proposing anything logically [17:48:14] deleted contributions are still assigned, and sysops may want to know where it went. [17:48:36] just if a row is deleted, know it so I can actually free space [17:48:46] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [17:48:47] delete here I mean physically deleted [17:48:52] Yeah :/ [17:48:56] not page-deleted [17:49:07] jynus: it saves space in localisation caches in memc and APC though [17:49:25] but if it is so low, it will be taken care during normal maintenance [17:49:26] and it saves slave queries [17:49:31] (03CR) 10Arseny1992: Rename -labs to -beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) (owner: 10Reedy) [17:49:33] since CDB/JSON is faster. [17:49:42] but that's not the primary objective, but nice win [17:49:51] Krinkle: maybe there's a way to query the DB to see the wikis with most MediaWiki messages? [17:50:06] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [17:50:21] or limit it to small.dblist and medium.dblist [17:50:35] mafk: By default it doesnt delete. so you could run it foreachwiki and count the output [17:50:42] anyway, thanks for doing that; and for the JS/CSS updates too [17:50:55] aah [17:51:02] like -dry-run [17:51:02] I'm using a bot lately - https://tools.wmflabs.org/guc/?src=rc&user=Krinkle [17:51:07] (03CR) 10Reedy: Rename -labs to -beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) (owner: 10Reedy) [17:51:17] https://github.com/Krinkle/mw-tool-tourbot [17:51:46] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is OK: Files ownership is ok. [17:52:37] growr [17:53:06] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [17:54:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 (duration: 02m 45s) [17:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:25] ssh: connect to host mw1214.eqiad.wmnet port 22: Connection timed out [17:55:47] it's already back jynus [17:55:52] yep [17:55:59] pulling from there manually [17:56:03] k [17:56:32] 1215 and 1216 are happening now, then a break to see what is going on for the next swat slot [17:56:45] I just want to log the errors, not complaining [17:56:49] okey dokey [17:57:27] in fact, I thank you for the reboots you are doing [17:58:36] Do we know anything about caching issues in the last hours? Co-worker says that users report that they have to reload wiki pages to see changes [17:58:48] (sorry for the stupid question but didn't follow the channel closely for the last hours) [17:58:53] * andre__ trying to get more info [17:59:37] doing reboots beats watching the paint grow [17:59:50] andre__: editor-users, logged-in? normal users probably wouldn't know of a change to reload for... [18:00:01] 06Operations, 10Traffic, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2780213 (10Danielsberger) Here's a graph of the hit ratio for various cache sizes. {F4707605} It seems to me that this tells us: investing in a better admission policy pays of... [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1800). Please do the needful. [18:00:20] no parsoid deploys today [18:00:35] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2780214 (10fgiunchedi) Reported upstream as https://github.com/google/mtail/issues/50 [18:00:54] bblack: logged in. [18:01:19] (03PS2) 10Jcrespo: Depool db1089 to safely apply pending schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320423 (https://phabricator.wikimedia.org/T139090) [18:02:48] For instance, I have some feedback from yannf, about adding removing categories here https://commons.wikimedia.org/w/index.php?title=Commons_talk:Structured_data/Overview&curid=52515098&diff=212573620&oldid=212554957 [18:03:01] andre__: hi [18:03:17] (03CR) 10Jcrespo: [C: 032] Depool db1089 to safely apply pending schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320423 (https://phabricator.wikimedia.org/T139090) (owner: 10Jcrespo) [18:03:21] hi mafk [18:03:54] Trizek: (wondering if timestamps might be any helpful, like 14:37UTC for the fr.wp users on VP) [18:04:04] (03Merged) 10jenkins-bot: Depool db1089 to safely apply pending schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320423 (https://phabricator.wikimedia.org/T139090) (owner: 10Jcrespo) [18:04:10] Also https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Questions_techniques/semaine_45_2016#Echec de la purge, where a user needs to reload pages multiple times after changing content or categories. [18:04:16] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2780233 (10Marostegui) Great thank you! I will wait for you and once you are around I will shutdown the server then Thanks! [18:04:17] ^ bblack [18:05:09] That last example was reported today at 15:37 (CET) [18:06:21] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 to safely apply pending schema change (duration: 01m 02s) [18:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:07:06] 06Operations, 10Traffic, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2780263 (10BBlack) @Danielsberger - Fascinating stuff, thanks so much for running all of this data! We're kind of swamped in various things right now (varnish4 transition and v... [18:07:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:07:48] Trizek, that would sound more like job queue than cache [18:09:36] !log rolling reboots of mw1170-1179 for new kernel [18:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:43] oh, content, too? [18:12:06] !log performing schema change templatelinks on db1089 T139090 [18:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:11] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [18:13:00] So that's temporary, jynus? [18:13:43] I was checking the queue, there was an issue some ours ago that could cause temporary problems [18:13:56] if it is happening now, it is not temporary [18:15:02] the jobqueue could be backlogged, yes [18:15:24] as I remember, purging specifically has an output ratelimiter in place, so if inbound purging has surged, it can get queued up there [18:15:30] I just had a feedback from a user: that happens to him 2-3 hours ago. [18:15:53] the same user on https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Questions_techniques/semaine_45_2016#Echec ? [18:16:30] 2-3 hours ago is when there were issues on app/db servers [18:16:36] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:17:08] job queue could have gotten momentarily overrun: https://grafana.wikimedia.org/dashboard/db/job-queue-health?var-jobType=categoryMembershipChange&from=now-24h&to=now-5m [18:17:13] <_joe_> The jobqueue has some backlog but overall seems healthy [18:17:17] right, 15:37 CET == 14:37 UTC [18:17:31] <_joe_> but purges do not depend on the jobqueue? [18:17:34] at thet time, I mean [18:17:39] not purges [18:17:42] _joe_: they do, they're jobs [18:17:50] also category memebership udates [18:17:59] <_joe_> bblack: oh ok so it's totally possible [18:18:09] I'm asking bblack [18:18:28] and also, from back when we were making more effort to control the ridiculous purge rate, some kind of output limitation was put on jobqueue purge emissions [18:18:46] so that spikes are supposed to backlog in the jobqueue rather than flood the network [18:18:59] <_joe_> https://grafana.wikimedia.org/dashboard/db/job-queue-health tells you we do have some backlog in the jobqueue [18:19:04] (although in practice, our graphs at the caches say we still get crazy spikes all the time) [18:19:12] <_joe_> I can look into what is lagging behind [18:19:27] also it grew negatively between 15 and 18h [18:20:23] https://grafana.wikimedia.org/dashboard/db/job-queue-health?var-jobType=htmlCacheUpdate [18:20:31] I am not worried, I am just saying it could create some disruption [18:20:32] ^ htmlCacheUpdate seems to be the bulk [18:20:49] <_joe_> cdnPurge: 2 queued; 0 claimed (0 active, 0 abandoned); 11 delayed on frwiki [18:20:57] <_joe_> at the moment [18:21:01] oh, the graph is confusing [18:21:10] the top numbers don't reflect the type-selection :P [18:21:13] (at the time) [18:21:27] cdnPurge is different. I think that's later in the pipeline (it can be backlogged as htmlCacheUpdate too, which generates cdnPurge) [18:22:12] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2780319 (10Gilles) So, sorting through the information in that past, the actual line is: ``` Nov 8 04:32:57 Nov 8 04:32:57 mw2213 kernel [1614712.454448]... [18:22:54] <_joe_> bblack: cdnPurge wait times are much more reasonable [18:22:57] <_joe_> https://grafana.wikimedia.org/dashboard/db/job-queue-health?var-jobType=cdnPurge [18:23:40] yeah but I think htmlCacheUpdate generates cdnPurge [18:23:48] meaning if the former is backlogged, it's a virtual backlog for the latter [18:24:05] something happened Nov 3 [18:24:28] avg from 300ms to current 2 seconds [18:25:37] the 50th percentile for the past week is 10 minutes on htmlCacheUpdate, and it's hours at worse pecentiles [18:26:16] oh current, not past week [18:26:21] it seems it is just a weekly pattern [18:28:47] andre__: wrt acl*: yes, but then we'd need someone able to add that group to the "Visible To" field of the relevant task, which requires somebody from Security, acl*phabricator or Policy-Admins I think (although I'm lost with this new policies/spaces/groups) [18:28:55] (03PS1) 10Jcrespo: Revert "Depool db1089 to safely apply pending schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320432 [18:29:15] anyway, seems too much work [18:30:28] mafk: no [18:30:36] (and offtopic here) [18:30:53] Noone would need to edit a "Visible To" field. [18:30:57] (03CR) 10Jcrespo: [C: 04-2] "Waiting for the schema change to finish." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320432 (owner: 10Jcrespo) [18:31:47] 06Operations, 05Prometheus-metrics-monitoring: prometheus-node-exporter package should use a systemd override - https://phabricator.wikimedia.org/T149992#2780341 (10fgiunchedi) p:05Triage>03Normal a:03fgiunchedi [18:36:10] (03PS1) 10Filippo Giunchedi: prometheus: use systemd override for node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/320434 (https://phabricator.wikimedia.org/T149992) [18:39:12] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2780371 (10fgiunchedi) That's correct, though it seems to try and convert `"kernel: [1614712.454448] EDAC MC1: 1 CE"` to an int and fail :( [18:40:06] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [18:42:06] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:48:38] (03PS3) 10DCausse: [cirrus] Increase the number of shards to 15 for commonswiki_file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) [18:51:26] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:51] (03PS1) 10Yurik: LABS: fixed incorrect $wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320439 [18:58:42] Reedy, still doing depls? I messed up my labs config patch - https://gerrit.wikimedia.org/r/#/c/320439/ [18:59:19] andre__, pretty sure they would need to edit Visible To to add an extra group [18:59:32] !log rolling reboots of mw1180-1188 for new kernel [18:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1900). [19:00:19] I think the same as Krenair but I'm not sure :) [19:00:28] are we SWATting now? [19:00:34] or it's gonna be halted? [19:00:38] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2780484 (10GWicke) Thank you, @KartikMistry! [19:00:53] jouncebot: now [19:00:54] For the next 0 hour(s) and 59 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1900) [19:01:14] no problems on ops end [19:01:29] I have a patch for SWAT from the missed EUSWAT [19:01:29] ah, was just about to ask :) [19:01:32] I can SWAT [19:01:35] let me relist [19:01:41] note that due to the rolling reboots I just logged, you may wind up doing a manual pull on some mw host [19:01:42] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2780487 (10Gehel) [19:01:45] yurik , jit for swat :d [19:02:20] I am not the swat ops babysitter btw, just happen to still be around [19:02:40] is that really what you call it [19:02:42] babysitting? [19:02:42] Is there an ops babysitter? /me didn't know [19:02:45] addshore: you've got some patches for EUSWAT, want them rolled now? [19:03:05] dunno which swats have one and which don't [19:03:21] but I sort of thought there was always someone available just in case [19:03:31] probably, but 'babysitting'? [19:03:32] it's a good idea, indeed. [19:03:32] mafk: mine is already in! [19:03:43] babysitting whatever changes get deployed [19:03:47] abuse :P [19:03:49] yeah that's usually how I call it [19:04:20] thcipriani: should be on Wikitech:Deployments now [19:04:24] Krenair: one day we'll be all grown up with our unix beards and everything [19:04:30] (03PS2) 10Thcipriani: Set timezone for bdwikimedia to 'Asia/Dhaka' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320397 (https://phabricator.wikimedia.org/T150252) (owner: 10MarcoAurelio) [19:04:35] service: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T1900 [19:04:45] mafk: cool, thanks :) [19:04:45] no no not the deployers, the patches get babysitted. the patches don't get to grow up and get beards (I hope!) [19:04:53] hahah [19:04:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320397 (https://phabricator.wikimedia.org/T150252) (owner: 10MarcoAurelio) [19:05:06] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2780491 (10GWicke) @MoritzMuehlenhoff, it looks like we'll be ready fairly soon. I know that you are planning to be out of the office soon-ish. *Iff* you find some time to... [19:05:13] well, some patches are going to grow dust and fungi [19:05:14] (03CR) 10Jcrespo: [C: 031] Revert "Depool db1089 to safely apply pending schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320432 (owner: 10Jcrespo) [19:05:34] mmm code rot [19:05:38] (03Merged) 10jenkins-bot: Set timezone for bdwikimedia to 'Asia/Dhaka' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320397 (https://phabricator.wikimedia.org/T150252) (owner: 10MarcoAurelio) [19:06:29] ok :) [19:07:20] mafk: change is live on mw1099, check please [19:07:40] ack, check incoming [19:07:59] (03PS1) 10BryanDavis: logstash: dynamically rename object values [puppet] - 10https://gerrit.wikimedia.org/r/320441 (https://phabricator.wikimedia.org/T150106) [19:09:16] (03CR) 10BryanDavis: [C: 04-1] "Completely untested at this point. Will need to be rolled out in beta cluster ELK install first and some tests done to see if this is a ge" [puppet] - 10https://gerrit.wikimedia.org/r/320441 (https://phabricator.wikimedia.org/T150106) (owner: 10BryanDavis) [19:10:00] thcipriani: I've checked via API action=query meta=siteinfo for the timezone and I see it changed on mw1099 [19:10:35] "timezone": "Asia/Dhaka", [19:10:51] mafk: ok, going live everywhere [19:11:13] :) [19:11:59] thcipriani, i just added 320439 to swat [19:12:03] minor labs config change [19:12:09] yurik: ok [19:12:45] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:320397|Set timezone for bdwikimedia to "Asia/Dhaka" (T150252)]] (duration: 00m 47s) [19:12:51] ^ mafk live everywhere [19:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:52] T150252: Change the timezone of WMBD chapter wiki - https://phabricator.wikimedia.org/T150252 [19:13:22] Thanks! [19:14:12] (03PS2) 10Thcipriani: LABS: fixed incorrect $wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320439 (owner: 10Yurik) [19:14:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320439 (owner: 10Yurik) [19:14:20] (03CR) 10Reedy: Rename -labs to -beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) (owner: 10Reedy) [19:14:58] (03Merged) 10jenkins-bot: LABS: fixed incorrect $wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320439 (owner: 10Yurik) [19:15:16] ^ yurik that should go out on beta with the next beta-code-update-eqiad/beta-scap-eqiad cycle [19:15:26] thcipriani, thx :) [19:15:38] (will also sync now for housekeeping purposes) [19:17:52] Re: that question of users complaining about refreshing - the last user has issues to add categories at the same moment as others. That's a queue issue. [19:19:56] !log thcipriani@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:320439|LABS: fixed incorrect $wgGraphAllowedDomains]] (housekeeping sync) (duration: 02m 42s) [19:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:26] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:22:36] !log T133395: Converting 25 additional RESTBase tables to TWCS [19:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:42] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [19:25:32] (03PS2) 10Jcrespo: Revert "Depool db1089 to safely apply pending schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320432 [19:26:43] 3 more reboots to go, then done with the mws [19:26:44] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1089 to safely apply pending schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320432 (owner: 10Jcrespo) [19:27:25] (03Merged) 10jenkins-bot: Revert "Depool db1089 to safely apply pending schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320432 (owner: 10Jcrespo) [19:28:45] (03PS3) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318356 (https://phabricator.wikimedia.org/T147508) [19:29:16] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 (duration: 00m 48s) [19:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:37] did someone recently tinker with dns? [19:37:03] dns changes on a daily basis arseny92 [19:37:09] why do you ask? [19:37:21] several times a day usually [19:37:24] i can't access phab [19:38:25] ok [19:38:39] can you ping it? [19:39:08] ping says phabricator.wikimedia.org at 2620:0:862:ed1a::3:d and 91.198.174.217 [19:39:42] that's right [19:39:53] can you ping those ips? [19:41:10] yes, but browsers refuses to resolve it [19:42:24] hm still 43 mws unaccounted for [19:43:02] falling back to page can't be displayed, and f12 console says the error page for res://dnserror.htm is loaded [19:43:46] gidn't changed any network stuff tbh it just suddenly stopped working [19:48:26] sometimes it loads but with a broken ui as if some stuff missing anf content is stuck on Loading... [19:49:22] guess I'll wait on more reboots til after the train rolls [19:51:31] Krenair [19:52:48] jouncebot: next [19:52:48] In 0 hour(s) and 7 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T2000) [19:52:49] windows? [19:53:01] yes [19:53:14] which browser? [19:54:24] now it loaded half of page without css , after a refresh again fell to dnserror [19:54:38] ie [19:54:40] is there by any chance an ori? [19:55:34] and it's not like i'd be having network issues otherwise i'd be timed out from irc [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161108T2000). [20:02:46] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:04:00] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2780700 (10GWicke) > It seems to me like the goal is to simplify what's currently available, with no plan to add new features The design is providing... [20:04:37] works again now but is quite slow [20:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:08:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:09:27] !log change-prop deploying 0c29003 [20:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:46] twentyafterfour: is the train rolling or do I have liesure time for some app server reboots? [20:11:52] leisure time [20:14:49] choo choo [20:15:16] was just getting ready to queue up another 5 reboots [20:22:37] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2780822 (10Gilles) No examples of what rotation, cropping or effects would look like are provided. No complex examples of current capabilities are pro... [20:30:28] apergos: you have some time [20:30:38] eta, twentyafterfour? [20:30:38] I'm still branching [20:30:46] I'd say 20 minutes? [20:30:49] ok [20:30:55] I'll do a few and check in again [20:30:56] thanks much [20:31:03] no problem! [20:31:23] this branching thing will be much faster in the near future [20:31:32] !log rolling restarts of mw1218-1222 for new kernel [20:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:46] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:32:50] hopefully today, when twentyafterfour runs the script that rolls the new mw out to the wiki's grrrit-wm wont crash [20:32:55] and instead works :) [20:33:27] :-) [20:33:56] in my testing it should all work, the fix was done by twentyafterfour :) [20:38:59] 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2780894 (10GWicke) FYI, we are also bumping up the HTTP socket timeout in hyperswitch from 2 to 6 minutes: https://github.com/wikimedia/hyperswitch/pull/70... [20:43:16] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3081022 keys, up 8 days 12 hours - replication_delay is 619 [20:51:31] train is going out as usual today? there were some issues earlier? [20:51:36] train is on [20:54:06] those reboots are done, there's about 40 left but they are codfw and can wait til tomorrow [20:55:41] !log upload prometheus-memcached-exporter 0.3.0+ds1-1 to carbon - T147326 [20:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:48] T147326: Port memcached statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147326 [20:56:24] Does the mw api forbid multiple concurrent sessions with the same user? I'm seeing a lot of Exception Caught: CAS update failed on user_touched for user ID '55' (read from replica); the version of the user to be saved is older than the current version. [20:56:41] (which is discussed here https://phabricator.wikimedia.org/T95839 but I don't understand yet) [20:58:02] no [20:59:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3061525 keys, up 8 days 12 hours - replication_delay is 0 [21:01:34] Krenair: ok, so it's a bug in MW then I guess… looks like activity on that issue died out a year ago or so [21:01:50] I guess I can catch that exception, count to 10, and try again :/ [21:03:29] (03PS1) 10Gergő Tisza: Make beta PageViewInfo use the production pageview API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320450 (https://phabricator.wikimedia.org/T129602) [21:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [21:07:03] allah is doing [21:07:11] sun is not doing allah is doing [21:10:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:18:51] andrewbogott, exceptions like that are always mw bugs, yes [21:18:54] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781024 (10GWicke) @gilles, there are several examples illustrating page and media thumbnails, as well as the (orthogonal) selection of size and thumb... [21:21:09] !log RESTBase update to 1d72b8abc - staging [21:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:21] !log RESTBase update to 1d72b8abc - canary on restbase1007 [21:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:18] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781089 (10Gilles) There is no example URL of the orthogonal case. Bandwidth is mentioned, but no example of its use is provided and even the path spec... [21:39:48] !log gallium, ex-CI server, shutdown -h now (the contents of your home dir have been copied to contint1001 in /home/gallium-home/) [21:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:27] yay [21:40:37] !log gallium, ex-CI server, shutdown -h now (the contents of your home dir have been copied to contint1001 in /home/gallium-home/) (T95757) [21:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:42] T95757: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757 [21:40:44] forgot the ticket [21:44:17] (03PS1) 10Andrew Bogott: wikistatus: work around occasional wikitech login failures [puppet] - 10https://gerrit.wikimedia.org/r/320482 (https://phabricator.wikimedia.org/T95839) [21:44:45] scapping... [21:44:57] !log twentyafterfour@tin Started scap: testwikis to 1.29.0-wmf.2 [21:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:23] (03PS2) 10Andrew Bogott: wikistatus: work around occasional wikitech login failures [puppet] - 10https://gerrit.wikimedia.org/r/320482 (https://phabricator.wikimedia.org/T95839) [21:56:18] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2781237 (10Reedy) [21:56:32] !log RESTBase update to 1d72b8abc [21:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:11] lol, I guess it's not fixed [22:01:32] Reedy you mean the bot? [22:01:37] yup [22:01:39] Nope, i restarted it [22:01:46] Since i am deploying three changes [22:01:47] https://gerrit.wikimedia.org/r/#/c/320214/ & https://gerrit.wikimedia.org/r/#/c/320419/ and https://gerrit.wikimedia.org/r/320480 [22:01:56] !log rolling reboot of mc2* for kernel update [22:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:44] (03CR) 10Andrew Bogott: [C: 032] wikistatus: work around occasional wikitech login failures [puppet] - 10https://gerrit.wikimedia.org/r/320482 (https://phabricator.wikimedia.org/T95839) (owner: 10Andrew Bogott) [22:04:45] yes sorry Reedy i meant to inform you all about the deployment of changes then i got caught up in 3 different converstations in 3 differen t chans [22:05:18] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781285 (10GWicke) > the description mentions the response containing JSON This is about other APIs, such as pageinfo & other references to thumbnails... [22:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:10:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:11:37] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304#2781311 (10Eevans) [22:22:02] !log rebooting ms1001 for kernel update [22:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:01] (03PS1) 10Andrew Bogott: wikistatus: fewer login tries with a longer delay between [puppet] - 10https://gerrit.wikimedia.org/r/320527 [22:28:18] (03PS1) 10Ppchelko: RESTBase config: Use special project for wikidata domains. [puppet] - 10https://gerrit.wikimedia.org/r/320529 [22:28:21] (03PS2) 10Dzahn: remove gallium.wikimedia.org, keep gallium.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318250 (https://phabricator.wikimedia.org/T95757) [22:31:36] mutante: are there any gallium related patch you want me to review ? :) [22:31:51] but I guess that is the last one \o// [22:32:24] hashar: yes, thanks, that's the last one i have, i think we're good [22:33:12] \o/ [22:34:30] !log twentyafterfour@tin Finished scap: testwikis to 1.29.0-wmf.2 (duration: 49m 32s) [22:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:05] 49m i think that is longest i seen a scap take so far (okay resuming back to my own things) [22:35:40] you're obviously new around here then [22:35:41] * Reedy grins [22:36:37] yeah, no kidding :) [22:36:58] i also dont tend to look at the durations :P [22:37:01] 40 was quick, and 60 normal for a long time [22:37:11] I think we had some 90 minute or more too [22:37:16] yeah, definitely [22:37:25] I seem to recall a 120+ one too [22:37:45] maybe? do we have that number in graphite still? hmmmm [22:38:00] i've been in here in and out when i first started dev on mw-core and after a while on mw-core dev and earlier tools-lab dev i stayed in here ~24/7 [22:38:08] Has twentyafterfour pushed the patch yet to update the wiki's? [22:38:17] Want to see if the bot will tell us he did the patch [22:38:20] i think so? [22:38:24] paladox: not yet [22:38:29] Ok thanks [22:38:30] :) [22:38:30] (03CR) 10Chad: "Thanks for the useless info Jenkins *facepalm*" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [22:38:37] I am doing that in about < 2 minutes [22:38:42] Oh :) [22:38:47] Zppix: I'm talking anything back... potentially 5 or 6 years [22:39:22] i wasnt even on enwiki 5-6 years ago atleast as auser i occasionaly looked through the pages on it thats all [22:39:29] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320533 [22:39:31] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320533 (owner: 1020after4) [22:39:39] ^ paladox [22:39:42] paladox: looks like it worked :) [22:39:49] Yay [22:39:51] :) [22:39:56] thanks [22:40:06] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320533 (owner: 1020after4) [22:40:29] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.2 [22:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:33] (03CR) 1020after4: "chad: It's fairly procedural. It would be good to incorporate it into a multi-step process for managing everything but the process could f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (https://phabricator.wikimedia.org/T118478) (owner: 1020after4) [22:48:39] (03CR) 1020after4: [C: 031] "by the way, I tested this today for applying patches to wmf.2 and it works flawlessly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (https://phabricator.wikimedia.org/T118478) (owner: 1020after4) [22:57:16] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2781465 (10ggellerman) p:05Triage>03Normal [22:57:27] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781464 (10Anomie) >>! In T66214#2781285, @GWicke wrote: > 2) Validate parameters in the backing service strictly, so that each unique thumb can only b... [23:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [23:08:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:10:10] (03CR) 10Hashar: "recheck" [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 (owner: 10BBlack) [23:13:37] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781476 (10GWicke) @anomie, adding parameters without changing the semantics of existing parameters won't break any existing clients, as none of those... [23:14:26] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:36:47] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781560 (10Anomie) Say we didn't already have a parameter to select the page of a PDF. Then we add the parameter. Either the parameter needs to be opti... [23:42:26] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures