[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160122T0000). Please do the needful. [00:00:04] urandom Dereckson jamesofur: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:14] hi [00:00:35] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: configure cirrus completion suggester recycling (duration: 01m 28s) [00:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:00:45] legoktm: not yesterday, but he noticed today [00:00:58] sounds like ebernhardson is running swat today [00:01:15] Hello. [00:01:18] i'm just finishing up syncing out a config patch, but i guess i can do the others too [00:01:21] o7 [00:01:27] I'm reviewing Jamesofur's WMMessages patchf [00:01:52] because he's awesome, obviously [00:02:05] wmf2020 is complaining about remote host identification has changed from tin [00:02:08] during sync-file [00:02:19] i'm guessing thats expected, reimaged or something? [00:02:24] so... https://gerrit.wikimedia.org/r/#/c/264947/ [00:02:29] * urandom is available [00:02:40] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: configure cirrus completion suggester recycling (duration: 01m 29s) [00:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:51] ebernhardson, was known earlier [00:04:13] Krenair: then why isn't it fixed :P [00:04:20] (03CR) 10EBernhardson: [C: 032] Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265576 (https://phabricator.wikimedia.org/T124284) (owner: 10Dereckson) [00:04:22] ask ops, not me [00:04:50] (03CR) 10EBernhardson: [C: 032] Don't send messages to autocreated accounts on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265626 (https://phabricator.wikimedia.org/T122441) (owner: 10Dereckson) [00:04:52] when I go to mw2020, I just get a password prompt [00:05:02] (03Merged) 10jenkins-bot: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265576 (https://phabricator.wikimedia.org/T124284) (owner: 10Dereckson) [00:05:03] I don't have a password set. [00:05:24] (03CR) 10EBernhardson: [C: 032] Enable SandboxLink on lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265629 (https://phabricator.wikimedia.org/T121524) (owner: 10Dereckson) [00:05:32] (03Merged) 10jenkins-bot: Don't send messages to autocreated accounts on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265626 (https://phabricator.wikimedia.org/T122441) (owner: 10Dereckson) [00:05:40] Dereckson: yours are all going out here in a moment [00:05:47] Ready to test them. [00:05:50] 6operations: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1954144 (10Krenair) [00:06:09] (03Merged) 10jenkins-bot: Enable SandboxLink on lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265629 (https://phabricator.wikimedia.org/T121524) (owner: 10Dereckson) [00:08:06] * ebernhardson loves waiting for the timeout on the last server :P [00:08:08] !log ebernhardson@tin Synchronized wmf-config/throttle.php: Santiago Editatón throttle rule (duration: 01m 27s) [00:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:37] ebernhardson: You're gonna love this more: https://gerrit.wikimedia.org/r/#/c/265658/2 needs a scap :/ [00:08:46] * RoanKattouw will cherry-pick it to wmf* once it merges [00:09:19] ummm [00:09:24] sorry, but can we not? [00:09:29] we're still trying to debug the UBN [00:09:41] yeah, no scaps right now [00:10:06] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: enable sandboxlink on ladwiki and dont sent messages to autocreated accounts on metawiki (duration: 01m 27s) [00:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:13] Ugh, OK [00:10:17] :-/ [00:10:30] Is there a way to deploy an i18n-altering commit without scpa? [00:10:30] Also, which UBN? [00:10:42] https://phabricator.wikimedia.org/T124356 [00:10:57] o.O [00:11:13] Dereckson: its all out now [00:11:14] 265629 tested. [00:11:24] urandom: around? [00:11:29] ebernhardson: i am [00:12:18] wfDebug is only logged on test and test2 right? [00:12:20] (03CR) 10EBernhardson: [C: 032] enable EventBus extension on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265553 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [00:12:30] legoktm: iirc, yes [00:12:44] i looked into it a few weeks ago for something else and determined it wasn't for most wiki's at least [00:12:54] hmm, would be nice if we could just do that for all mw1017 requests... [00:13:06] (03Merged) 10jenkins-bot: enable EventBus extension on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265553 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [00:14:39] 6operations: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1954204 (10Andrew) a:3MoritzMuehlenhoff And by 'help' I mean, either tell me how to make my own, or just make one and drop it in the repo. Thanks! [00:14:44] ebernhardson: I'm going to live hack some stuff in includes/parser on mw1017 [00:14:53] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: enable EventBus extension on mediawikiwiki (duration: 01m 27s) [00:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:03] legoktm: should be safe enough, no scap's going to change anything atm :) [00:15:10] ebernhardson: Okay. So 265629 is tested fine. For 265626, I'll monitor how things evolve on meta., there aren't talk pages created right now by NewUserMessage, so it seems okay. The other isn't testable. [00:15:18] Thank you for the deployment. [00:15:43] in the future though, perhaps gethostname() could be checked in CommonSettings.php or somewhere, and set that up? [00:15:56] Dereckson: perfect, thanks [00:16:02] urandom: yours is deployed, please check [00:16:38] (03CR) 10EBernhardson: [C: 032] Add ability for OfficeWiki sysops to add and remove flood group rights from themselves. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264930 (https://phabricator.wikimedia.org/T86237) (owner: 10Varnent) [00:16:53] Jamesofur: yours will be going out in a moment [00:16:59] Jamesofur: only hte officewiki thing [00:17:01] thx [00:17:12] ebernhardson: yup, i'm on it [00:17:19] (03Merged) 10jenkins-bot: Add ability for OfficeWiki sysops to add and remove flood group rights from themselves. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264930 (https://phabricator.wikimedia.org/T86237) (owner: 10Varnent) [00:18:46] I see it on the special page, having someone test [00:19:17] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Add ability for OfficeWiki sysops to add and remove flood group rights from themselves. (duration: 01m 27s) [00:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:53] ebernhardson: everything looks good; thanks! [00:20:10] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [00:21:58] ebernhardson: that looks good [00:22:04] ok that should be everyone, swat is completed [00:24:32] Jamesofur: I lined up a cherry-pick for you for future use: https://gerrit.wikimedia.org/r/#/c/265661/ [00:24:49] once greg-g et al decide it's safe to push out an i18n change [00:24:59] Thanks much [00:25:31] yeah, it's just a really touchy/annoying issue we're (lego's) trying to diagnose [00:25:56] Sure thing [00:26:14] greg-g: yeah, understand, if at all possible I'd like to try and get it out before Monday (realizing that that was the last SWAT of the week :-/ ) but I don't want to hurt more things in the process obviously [00:26:15] Jamesofur: Only thing is I have to leave around 6 to catch a flight so you'll have to find someone else to deploy it for you unless this whole thing clears up in the next hour or so [00:26:24] * Jamesofur nods [00:26:26] thanks RoanKattouw [00:26:33] no worries, we'll figure something out [00:27:48] ah, I see what it is, yeah, once we get this sorted we can push that out, tomorrow if need be [00:27:52] cc Jamesofur [00:28:00] Thanks much [00:38:43] 6operations, 10ops-codfw: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1954300 (10Papaul) I am getting a memory error on this system that is causing the system after reboot to hang on press F1 to continue or F2 to run system setup error:Unsupported configurat... [00:42:08] (03PS1) 10Dereckson: Enable SandboxLink on nl.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265666 (https://phabricator.wikimedia.org/T124167) [00:43:35] anyone know if there is something on grafana that can tell me what the database slave lags were at certain time? [00:43:40] not sure what to look for [00:43:52] or graphite [00:44:28] aude: dunno, relatedly, did you know of #wikimedia-databases ? [00:44:54] aude: tendril has some historic graphs... [00:45:09] (03PS3) 10Nuria: Removing code that generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) [00:45:16] greg-g: no :/ [00:45:19] legoktm: thanks [00:45:38] aude: not meaning to say "go there!" just mostly an FYI [00:46:11] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:46:14] ok [00:46:20] * aude looking at https://phabricator.wikimedia.org/T47839 [00:46:34] and suspect maybe just slave lag was high at the moment [00:48:56] greg-g: uh, I'm giving up. [00:49:01] I have no idea what's wrong [00:49:43] legoktm: can you summarize what you tried/rulled out on the task, please [00:49:53] thanks for digging :/ [00:50:59] yeah, will do [00:51:43] legoktm: can james/roan do a scap now, or will that potentially mess up anything else you've tried/hurt future investigation? [00:51:58] it should be fine [00:52:17] a scap mainly would have overwrote my local hacks [00:52:25] * greg-g nods [00:52:36] RoanKattouw: Jamesofur if ya'll wanna do that change now, youc an [00:52:43] \o/ [00:52:48] RoanKattouw: still have time? [00:53:14] Yeah let's do it [00:54:44] RoanKattouw: IF you're gonna scap... [00:55:06] https://gerrit.wikimedia.org/r/#/c/265149/ and https://gerrit.wikimedia.org/r/#/c/265147/ [00:55:11] Want to fix some broken aliases too? [00:55:30] 149 and https://gerrit.wikimedia.org/r/265150 even [00:56:55] ah, .10 is disabled [00:57:05] So only https://gerrit.wikimedia.org/r/#/c/265149/ [00:58:38] Sure [01:02:43] Thanks :) [01:05:39] greg-g: heh, while writing it up I think I figured it out [01:07:10] yay! [01:07:16] rubber ducky solution! [01:07:59] https://phabricator.wikimedia.org/T124356#1954372 [01:13:12] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:13:22] !log catrope@tin Started scap: Deploying OATHAuth and WikimediaMessages i18n changes [01:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:13:30] OK, starting scap [01:16:46] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 526 MB (5% inode=80%) [01:17:12] \o/ [01:17:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:21:42] andrewbogott: ^^ [01:33:00] Ugh, I just got a key verification error for mw2020 [01:33:11] RoanKattouw: good news I currently see the spot for the link on the meta footer though it doesn't have the messages yet (so is showing as < > and without the link) does that just come in later during the scap? [01:33:12] RoanKattouw: known [01:33:12] Known issue [01:33:30] Jamesofur: It's at 61% [01:33:34] * Jamesofur nods [01:33:41] * Jamesofur patiently waits [01:33:45] That said, the failure mode you're seeing usually shouldn't happen, but let's wait till it finishes [01:34:07] Didn't I just see something else about rolling back to .10? [01:34:19] [01:21:33] MediaWiki-Interface, MobileFrontend, MW-1.27-release, Regression: Mobile(?) content of the page within Vector UI unexpectedly randomly delivered on 1.27.0-wmf11 - https://phabricator.wikimedia.org/T124356#1954408 (tstarling) Can we roll back to 1.27.0-wmf.10? [01:35:40] I'm going afk for a bit, I'll start the next set of CentralAuth scripts in a few hours [01:37:14] !log restbase cassandra: increased compression chunk size from 256 to 512k on wikimedia and wikipedia html and data-parsoid [01:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:39:34] Also, ssh: connect to host mw2039.codfw.wmnet port 22: Connection timed out [01:41:14] That one has a memory error [01:41:17] should be depooled tbh [01:41:24] https://phabricator.wikimedia.org/T124282#1954300 [01:41:34] RoanKattouw: I do now see it with the correct messages [01:42:30] 6operations, 10ops-codfw: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1954457 (10greg) Please depool this from the app server pools, it's reporting errors to all deployers. [01:42:57] 6operations, 10ops-codfw: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1954458 (10greg) (Hit submit too soon) See https://tools.wmflabs.org/sal/log/AVJliDzE_u6lr-TPEinw [01:44:15] !log catrope@tin Finished scap: Deploying OATHAuth and WikimediaMessages i18n changes (duration: 30m 52s) [01:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:37] (03PS1) 10Reedy: memory error on mw2039 [puppet] - 10https://gerrit.wikimedia.org/r/265675 (https://phabricator.wikimedia.org/T124282) [01:44:40] RoanKattouw: as far as I cant ell fully verified, THANK YOU! [01:44:45] ^ that's to remove it from the dsh list [01:44:48] Let me know if you need specific types of chocolates etc :) [01:45:31] (03CR) 10jenkins-bot: [V: 04-1] memory error on mw2039 [puppet] - 10https://gerrit.wikimedia.org/r/265675 (https://phabricator.wikimedia.org/T124282) (owner: 10Reedy) [01:46:01] can you also kick mw2020 out while you're at it Reedy? [01:46:02] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265675 (https://phabricator.wikimedia.org/T124282) (owner: 10Reedy) [01:46:12] it just prompts for password [01:46:21] Does that have a ticket? [01:46:54] no, I'll make one [01:47:14] (03PS1) 10Reedy: mw2020 is prompting for a password [puppet] - 10https://gerrit.wikimedia.org/r/265677 [01:48:14] (03CR) 10jenkins-bot: [V: 04-1] mw2020 is prompting for a password [puppet] - 10https://gerrit.wikimedia.org/r/265677 (owner: 10Reedy) [01:48:49] 6operations, 10ops-codfw: mw2020 ssh prompts for password - https://phabricator.wikimedia.org/T124380#1954477 (10Krenair) 3NEW [01:49:19] (03PS2) 10Reedy: mw2020 is prompting for a password [puppet] - 10https://gerrit.wikimedia.org/r/265677 (https://phabricator.wikimedia.org/T124380) [01:49:42] 6operations, 10ops-codfw: mw2020 ssh prompts for password - https://phabricator.wikimedia.org/T124380#1954486 (10Reedy) https://gerrit.wikimedia.org/r/265677 [01:51:01] was just looking at my puppet contribs [01:51:10] #29 Krenair [01:51:14] 96 commits / 1,586 ++ / 19,302 -- [01:53:54] Krenair: well done [01:54:21] How have you net removed nearly 18k lines? lol [01:54:39] No idea [01:54:49] But I like that I've removed so many more times than added [02:00:37] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (fix-beta-apaches)$ git log --author="Alex Monk" --stat | grep "\| *[0-9\-]{4,} " -E .../files/havana/nova/virt-libvirt-driver | 4891 ------------------ [02:00:37] .../files/icehouse/nova/virt-libvirt-driver | 5328 -------------------- [02:00:38] .../icehouse/ceilometer/ceilometer.conf.erb | 1015 ---- [02:01:06] heh [02:01:07] https://github.com/wikimedia/operations-puppet/commit/f99185a65c9bc64cefdb8633ac54b079b3e264ae [02:01:13] nearly 250 in that [02:02:06] so that leaves like 8k lines deleted [02:02:35] still multiple times lines added [02:09:47] so what's the status now on .11 ? [02:10:02] is it still everywhere? [02:10:13] nowhere? [02:10:40] oh, everywhere indeed [02:10:44] well enwiki has it, so I guess yes it's everywhere still :) [02:10:59] Currently active MediaWiki versions: 1.27.0-wmf.11 [02:11:01] PROBLEM - puppet last run on wtp2011 is CRITICAL: CRITICAL: puppet fail [02:30:09] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 09m 31s) [02:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:51] RECOVERY - puppet last run on wtp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:40:32] !log tstarling@tin Synchronized php-1.27.0-wmf.11/includes/OutputPage.php: (no message) (duration: 01m 32s) [02:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:31] !log tstarling@tin Synchronized php-1.27.0-wmf.11/includes/parser/ParserCache.php: (no message) (duration: 01m 28s) [02:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:24] RECOVERY - MariaDB disk space on silver is OK: DISK OK [03:51:01] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 136555 MB (3% inode=99%) [04:05:56] -rw-r--r-- 1 udp2log udp2log 102G Jan 22 04:05 api.log [04:05:58] -rw-r--r-- 1 udp2log udp2log 101G Jan 22 04:05 CirrusSearchRequests.log [04:06:03] -rw-r--r-- 1 udp2log udp2log 102G Jan 22 04:05 DBPerformance.log [04:08:33] DBPerformance is mostly "Expectation (masterConns <= 0) by MediaWiki::main not met:" followed by wfBacktrace output including CentralAuthHooks::onSessionCheckInfo() -> CentralAuthUser->renameInProgress() -> ... -> CentralAuthUtils::getCentralDB() [04:09:30] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [04:11:30] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1954591 (10Volker_E) +1 WFM too! Awesome, thanks all people involved! > time GIT_SSH_COMMAND="ssh -v" git clone ssh://vcs@git-ssh.wikimedia.org/... [04:15:07] krenair@fluorine:/a/mw-log$ tail -f DBPerformance.log | grep "DBPerformance INFO" | grep MediaWiki::main | pv -lri5 > /dev/null [04:15:07] [2.15k/s] [04:52:30] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:07:31] !log tstarling@tin Synchronized php-1.27.0-wmf.11/includes/parser/ParserCache.php: (no message) (duration: 01m 28s) [06:09:21] 6operations, 6Analytics-Kanban, 7HTTPS, 5Patch-For-Review: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1954687 (10leila) @Ottomata I checked couple of tables that I knew and the diversity of hashed IPs looks healthy. I also looked at two of Tilma... [06:15:16] (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [06:27:30] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: puppet fail [06:30:53] (03PS1) 10KartikMistry: WIP: cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) [06:31:30] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:30] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:34:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:37:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:37:49] (03CR) 10Luke081515: [C: 031] Enable SandboxLink on nl.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265666 (https://phabricator.wikimedia.org/T124167) (owner: 10Dereckson) [06:38:21] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:55:30] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:01] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:31] PROBLEM - Host mw2020 is DOWN: PING CRITICAL - Packet loss = 100% [07:34:12] <_joe_> me ^^ [07:36:13] 6operations, 10ops-codfw, 5Patch-For-Review: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1954805 (10Joe) @greg I was actually waiting for Papaul to take a look before permanently depooling this server from rotation. We usually allow a 1-day period to trouble... [07:36:20] RECOVERY - Host mw2020 is UP: PING OK - Packet loss = 0%, RTA = 37.15 ms [07:41:34] 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1954806 (10Joe) @papaul thanks a lot! Should we schedule a general upgrade of the DRACs for those elder appservers maybe? [07:52:05] (03PS1) 10Giuseppe Lavagetto: dsh: remove mw2039 from the deployment dsh group temporarily [puppet] - 10https://gerrit.wikimedia.org/r/265693 (https://phabricator.wikimedia.org/T124282) [07:53:08] (03CR) 10jenkins-bot: [V: 04-1] dsh: remove mw2039 from the deployment dsh group temporarily [puppet] - 10https://gerrit.wikimedia.org/r/265693 (https://phabricator.wikimedia.org/T124282) (owner: 10Giuseppe Lavagetto) [07:54:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/265693 (https://phabricator.wikimedia.org/T124282) (owner: 10Giuseppe Lavagetto) [07:56:32] 7Puppet, 6Release-Engineering-Team, 7Jenkins: Jenkins jobs for puppet failing for no good reason - https://phabricator.wikimedia.org/T124395#1954817 (10Joe) 3NEW [07:57:21] 6operations, 10ops-codfw: mw2020 ssh prompts for password - https://phabricator.wikimedia.org/T124380#1954825 (10Joe) The problem was that the server was set up to reboot to PXE, so it has effectively been reimaged. I'm finalizing the installation now. [07:57:27] 6operations, 10ops-codfw: mw2020 ssh prompts for password - https://phabricator.wikimedia.org/T124380#1954826 (10Joe) 5Open>3Resolved [07:57:51] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Server will be ready in a few" [puppet] - 10https://gerrit.wikimedia.org/r/265677 (https://phabricator.wikimedia.org/T124380) (owner: 10Reedy) [07:58:40] RECOVERY - Apache HTTP on mw2020 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.075 second response time [08:05:41] RECOVERY - RAID on mw2020 is OK: OK: no RAID installed [08:06:01] RECOVERY - Disk space on mw2020 is OK: DISK OK [08:06:10] RECOVERY - dhclient process on mw2020 is OK: PROCS OK: 0 processes with command name dhclient [08:06:10] RECOVERY - DPKG on mw2020 is OK: All packages OK [08:06:20] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures [08:06:41] RECOVERY - nutcracker port on mw2020 is OK: TCP OK - 0.000 second response time on port 11212 [08:06:51] RECOVERY - nutcracker process on mw2020 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [08:07:01] RECOVERY - HHVM processes on mw2020 is OK: PROCS OK: 6 processes with command name hhvm [08:07:41] RECOVERY - Check size of conntrack table on mw2020 is OK: OK: nf_conntrack is 0 % full [08:07:41] RECOVERY - configured eth on mw2020 is OK: OK - interfaces up [08:07:56] <_joe_> !log upgrading kernel on all mw hosts in eqiad [08:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:11] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 8 failures [08:11:41] RECOVERY - salt-minion processes on mw2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:15:51] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 1 failures [08:23:40] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:31:01] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:02] PROBLEM - HHVM rendering on mw2020 is CRITICAL: Connection refused [08:32:21] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:33:31] PROBLEM - Apache HTTP on mw2020 is CRITICAL: Connection refused [08:35:41] RECOVERY - Apache HTTP on mw2020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.644 second response time [08:37:31] RECOVERY - HHVM rendering on mw2020 is OK: HTTP OK: HTTP/1.1 200 OK - 70390 bytes in 0.752 second response time [08:41:30] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:42:00] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:46:40] <_joe_> !log rebooting mw1001 with a new kernel [08:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:10] <_joe_> !log powercycling ms-be1002, blank console, down [08:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:49:40] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:50:40] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [08:51:10] RECOVERY - very high load average likely xfs on ms-be1002 is OK: OK - load average: 10.43, 2.72, 0.92 [09:07:20] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:10:12] <_joe_> !log rolling restart of imagescalers in eqiad [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:11] <_joe_> hashar: ops/puppet gets constant -1s from jenkins now [09:19:21] _joe_: ooops [09:19:21] <_joe_> I opened an UBN! ticket for you [09:19:26] thank you [09:19:31] <_joe_> yw :P [09:19:35] had to take a nap this morning :( [09:19:44] <_joe_> heh, I'd need one as well [09:19:50] PROBLEM - Host mw1159 is DOWN: PING CRITICAL - Packet loss = 100% [09:20:34] 7Puppet, 10Continuous-Integration-Config: Jenkins jobs for puppet failing for no good reason - https://phabricator.wikimedia.org/T124395#1954910 (10hashar) a:3hashar [09:20:41] I did a sneak deploy yesterday [09:20:44] must have broken something [09:20:47] <_joe_> eheh [09:21:01] RECOVERY - Host mw1159 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [09:21:56] fatal: loose object 0a25119b9c0c2fb8705753a50cef3a576320549d (stored in .git/objects/0a/25119b9c0c2fb8705753a50cef3a576320549d) is corrupt [09:22:03] error: object file .git/objects/0a/25119b9c0c2fb8705753a50cef3a576320549d is empty [09:22:05] sounds bad :D [09:22:19] <_joe_> sounds like a corrupted repo [09:25:11] 7Puppet, 10Continuous-Integration-Config: Jenkins jobs for puppet failing for no good reason - https://phabricator.wikimedia.org/T124395#1954914 (10hashar) That is the job https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27/ which fails whenever it runs on the integration-slave-precise1011 . Th... [09:25:43] _joe_: yeah definitely. The git repo for a specific job is corrupted on the slave-precise1011 slave [09:25:44] cleaned it up [09:29:31] 7Puppet, 10Continuous-Integration-Config: Jenkins jobs for puppet failing for no good reason - https://phabricator.wikimedia.org/T124395#1954916 (10hashar) p:5Unbreak!>3Normal The object is a single file and is 0 size: ``` -r--r--r-- 1 jenkins-deploy wikidev 0 Jan 21 19:32 .git/objects/0a/25119b9c0c2fb8705... [09:29:44] _joe_: fixed, thanks for the task/notification [09:31:03] <_joe_> yw [09:53:01] PROBLEM - HHVM rendering on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:55:01] RECOVERY - HHVM rendering on mw1158 is OK: HTTP OK: HTTP/1.1 200 OK - 70374 bytes in 0.224 second response time [10:06:41] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 8.224 second response time [10:07:30] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 70429 bytes in 7.801 second response time [10:07:35] <_joe_> !log dropping api logs from 2015 on fluorine [10:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:20] PROBLEM - Host mw1259 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:50] RECOVERY - Host mw1259 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [10:12:00] PROBLEM - Host mw1260 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:51] RECOVERY - Host mw1260 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [10:25:55] (03PS1) 10BBlack: lvs: switch codfw backup LVS to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265704 [10:25:57] (03PS1) 10BBlack: lvs: switch all of codfw to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265705 [10:32:33] (03CR) 10Giuseppe Lavagetto: [C: 031] lvs: switch codfw backup LVS to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265704 (owner: 10BBlack) [10:32:51] (03CR) 10Giuseppe Lavagetto: [C: 031] lvs: switch all of codfw to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265705 (owner: 10BBlack) [10:34:53] <_joe_> !log rolling restart of all api appservers in eqiad [10:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:02] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:42:27] (03CR) 10DCausse: [C: 031] "Awesome, I have only one concern. Since total_shards_per_node does not seem to work properly, on codfw total_shards_per_node is 1 and we s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [10:44:00] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 70404 bytes in 1.223 second response time [10:45:00] PROBLEM - Host mw1136 is DOWN: PING CRITICAL - Packet loss = 100% [10:45:19] <_joe_> !log rolling restarting the API cluster in eqiad [10:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:21] RECOVERY - Host mw1136 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [10:46:23] (03PS2) 10Jcrespo: Depool pc1001 for maintenance (clone to pc1004) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265504 [10:46:54] (03CR) 10BBlack: [C: 032] lvs: switch codfw backup LVS to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265704 (owner: 10BBlack) [10:48:08] (03PS1) 10Giuseppe Lavagetto: monolog: ignore udp2log flow for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 [10:48:16] <_joe_> Reedy: ^^ [10:48:29] (03CR) 10jenkins-bot: [V: 04-1] monolog: ignore udp2log flow for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 (owner: 10Giuseppe Lavagetto) [10:48:53] <_joe_> ugh, shit [10:49:02] wtf did you do [10:49:06] <_joe_> I *hate* editors [10:49:07] you commented out most of the file :P [10:49:30] <_joe_> yeah wtf [10:50:06] We could just raise that one line to warning per tgr [10:50:13] it's because he uses emacs :P [10:51:00] <_joe_> bblack: no it's because I use flycheck and thus a linter that can be instructed to auto-lint files [10:51:03] <_joe_> and screwed up [10:51:23] (03CR) 10Jcrespo: [C: 032] Depool pc1001 for maintenance (clone to pc1004) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265504 (owner: 10Jcrespo) [10:52:15] (03PS2) 10Giuseppe Lavagetto: monolog: ignore udp2log flow for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 [10:53:09] <_joe_> ok this looks saner [10:54:11] :) [10:56:50] (03CR) 10BBlack: [C: 032] lvs: switch all of codfw to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265705 (owner: 10BBlack) [10:57:23] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1001 for maintenance (duration: 02m 48s) [10:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:07] <_joe_> jynus: ouch, I'm restarting api appservers now [10:58:13] <_joe_> should i stop? [10:58:16] no [10:58:36] it doesn't affect me [10:58:53] <_joe_> well, syncs might fail [10:58:58] lol [10:59:04] yes, but I can fix that [10:59:27] so another thing going on, apparently starting around 02:00 UTC on Jan 21, our volume of HTCP purges multiplied greatly [10:59:56] like, by a factor of 5 or so [11:00:12] I don't know if that's due to a code change, or someone kicked off some giant automated content-purge job, or? [11:00:43] I guess that's around an hour after https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T0000 [11:01:20] PROBLEM - Host mw1139 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:24] to be fair, I do not thing this deploy was very smooth [11:01:54] it's not the wmf11 one if that's what you mean [11:02:00] RECOVERY - Host mw1139 is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [11:02:06] it's a bit before all that [11:02:21] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:45] https://logstash.wikimedia.org/#dashboard/temp/AVJo__J5ptxhN1Xa0m-Z [11:04:20] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.279 second response time [11:05:02] <_joe_> Reedy: can I ask you a review of https://gerrit.wikimedia.org/r/#/c/265706 ? [11:05:29] Should I -1 it for style issues? :P [11:06:25] <_joe_> argh, damn mediawiki style guides :P [11:07:06] (03PS3) 10Giuseppe Lavagetto: monolog: ignore udp2log flow for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 [11:07:10] <_joe_> that better? [11:07:28] can you see, now everything is sinced back again [11:07:46] it doesn't affect me [11:10:02] PROBLEM - HHVM rendering on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.002 second response time [11:12:00] (03PS1) 10Giuseppe Lavagetto: conftool-data: re-introduce citoid service on SCA [puppet] - 10https://gerrit.wikimedia.org/r/265709 [11:12:11] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 70405 bytes in 1.390 second response time [11:13:49] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: re-introduce citoid service on SCA [puppet] - 10https://gerrit.wikimedia.org/r/265709 (owner: 10Giuseppe Lavagetto) [11:16:39] (03PS4) 10Giuseppe Lavagetto: monolog: ignore udp2log flow for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 [11:17:28] !log codfw LVS under etcd/conftool control now, like ulsfo [11:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:37] (03CR) 10Reedy: [C: 031] monolog: ignore udp2log flow for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 (owner: 10Giuseppe Lavagetto) [11:18:20] <_joe_> Reedy: thnx! [11:19:03] (03CR) 10Giuseppe Lavagetto: [C: 032] "We need to disable logging to fluorine at least for the weekend, or it will fill up quickly again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265706 (owner: 10Giuseppe Lavagetto) [11:21:51] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 136107 MB (3% inode=99%) [11:22:20] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:21] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.053 second response time [11:25:09] !log oblivian@tin Synchronized wmf-config/InitialiseSettings.php: Stop writing session logs to fluorine (duration: 01m 25s) [11:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:25:42] <_joe_> uhm I still see logs incoming on fluorine [11:26:21] <_joe_> any idea why? [11:30:04] <_joe_> in theory, with the array having 'udp2log' => false this should /not/ log to fluorine [11:32:41] PROBLEM - Host mw1133 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:11] RECOVERY - Host mw1133 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [11:35:21] !log oblivian@tin Synchronized wmf-config/InitialiseSettings.php: Re-synching (duration: 00m 31s) [11:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:40] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Jenkins jobs for puppet failing for no good reason - https://phabricator.wikimedia.org/T124395#1955103 (10hashar) 5Open>3Resolved Solved by clearing out the workspace. I have then migrate the job to run on Nodepool disposable instances, i.e. t... [11:42:11] (03PS1) 10Ema: codfw: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265710 (https://phabricator.wikimedia.org/T109286) [11:42:44] 6operations, 10ops-codfw: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1955110 (10Joe) 3NEW [11:43:45] (03CR) 10BBlack: [C: 031] codfw: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265710 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [11:44:02] (03PS1) 10Giuseppe Lavagetto: scap/dsh: remove mw2173 from scap list (broken disk) [puppet] - 10https://gerrit.wikimedia.org/r/265711 (https://phabricator.wikimedia.org/T124408) [11:44:13] (03CR) 10Giuseppe Lavagetto: [C: 031] codfw: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265710 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [11:44:18] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1955124 (10BBlack) [11:44:20] 6operations, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 10Traffic, and 3 others: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1955121 (10BBlack) 5Open>3Resolved a:3BBlack [11:44:40] (03CR) 10Giuseppe Lavagetto: [C: 032] scap/dsh: remove mw2173 from scap list (broken disk) [puppet] - 10https://gerrit.wikimedia.org/r/265711 (https://phabricator.wikimedia.org/T124408) (owner: 10Giuseppe Lavagetto) [11:44:56] (03CR) 10Giuseppe Lavagetto: [V: 032] scap/dsh: remove mw2173 from scap list (broken disk) [puppet] - 10https://gerrit.wikimedia.org/r/265711 (https://phabricator.wikimedia.org/T124408) (owner: 10Giuseppe Lavagetto) [11:49:43] jynus: for "s7 master is executing 54K transactions/s" , that is definitely a change that got pushed related to mw user session management [11:49:59] jynus: I have cced anomie tgr bd808 to the task [11:50:50] oh we conflicted [11:52:09] it really looks like since ~02:00 UTC Jan 21, someone's ripping through a massive purge job [11:52:16] we have scripts capable of that, right? [11:52:26] like, I see purges for articles on a wiki ripping through the alphabet :P [11:52:56] or is this a hook on administrative purge of parser cache, and we're doing that because we screwed up parser cache? [11:53:53] I only changed the parser cache a few hours ago [11:54:47] "changed"? [11:54:58] depooled 1/3 servers [11:55:10] <_joe_> jynus: when did you depool it?? [11:55:12] but that doesn't create "purges" [11:55:13] <_joe_> -? [11:55:17] <_joe_> it shouldn't [11:55:24] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1453148402001&to=1453463631092&var-site=All&var-cache_type=All&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&theme=dark [11:55:26] it is on the log [11:55:37] you can see the dramatic jump in purge there [11:55:54] 10:57 logmsgbot: jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1001 for maintenance (duration: 02m 48s) [11:56:18] there's a later bump at circa 19:00 and later, which is the multiplying of desktop purges by 2x from my MFE change [11:56:22] 1 hour ago [11:56:26] but the big bump predates that by many hours [11:57:03] <_joe_> ok so unrelated [11:57:20] <_joe_> wtf, the jump is jaw-dropping [11:58:04] _joe_, are you talking purge issue or centralauth/session issue? [11:58:12] <_joe_> the purge issue [11:58:17] <_joe_> the session issue is now over [11:59:09] not really [11:59:10] (03PS2) 10Ema: codfw: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265710 (https://phabricator.wikimedia.org/T109286) [11:59:29] (03CR) 10Ema: [C: 032 V: 032] codfw: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265710 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [11:59:31] but both are unrelated [11:59:40] PROBLEM - Host mw1201 is DOWN: PING CRITICAL - Packet loss = 100% [12:00:01] mine is deployment related- that happened hours after the deployment [12:00:51] RECOVERY - Host mw1201 is UP: PING OK - Packet loss = 0%, RTA = 3.49 ms [12:01:11] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:31] <_joe_> jynus: ok I was referring to the crazy debug log [12:01:33] <_joe_> :) [12:03:11] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.273 second response time [12:03:52] mmm [12:03:57] Autopurge.js [12:04:36] jynus: ? [12:05:18] I am grepping the logs, only getting some hipothetical offenders, but at least some leads [12:05:40] will share the interesting cases in provate [12:06:18] but this looks suspicious: "uri_query":"?title=MediaWiki:Autopurge.js&action=raw&ctype=text/javascript" [12:06:52] (03PS1) 10BBlack: quadruple vhtcpd buffer size from 256MB to 1GB [puppet] - 10https://gerrit.wikimedia.org/r/265713 [12:08:41] PROBLEM - HHVM rendering on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [12:09:01] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:03] (03CR) 10BBlack: [C: 032] quadruple vhtcpd buffer size from 256MB to 1GB [puppet] - 10https://gerrit.wikimedia.org/r/265713 (owner: 10BBlack) [12:10:51] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 70149 bytes in 0.789 second response time [12:11:01] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.142 second response time [12:11:57] (03Abandoned) 10Alexandros Kosiaris: mw2020 is prompting for a password [puppet] - 10https://gerrit.wikimedia.org/r/265677 (https://phabricator.wikimedia.org/T124380) (owner: 10Reedy) [12:22:02] (03Abandoned) 10Alexandros Kosiaris: memory error on mw2039 [puppet] - 10https://gerrit.wikimedia.org/r/265675 (https://phabricator.wikimedia.org/T124282) (owner: 10Reedy) [12:31:09] !log Starting migration of mobile traffic to text cluster https://phabricator.wikimedia.org/T109286 [12:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:21] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [12:41:08] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Delete config.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/265525 (owner: 10Alex Monk) [12:41:14] (03PS2) 10Alexandros Kosiaris: Delete config.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/265525 (owner: 10Alex Monk) [12:41:18] (03CR) 10Alexandros Kosiaris: [V: 032] Delete config.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/265525 (owner: 10Alex Monk) [12:44:10] PROBLEM - Host mw1140 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:00] RECOVERY - Host mw1140 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [12:51:51] PROBLEM - Host mw1206 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:00] RECOVERY - Host mw1206 is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms [12:57:11] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:11] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:10] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.148 second response time [12:59:11] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 0.713 second response time [13:01:11] PROBLEM - Apache HTTP on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:21] (03PS1) 10Giuseppe Lavagetto: monolog: explicitly declare logstash as debug for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265718 [13:02:27] <_joe_> ori: ^^ [13:02:39] (03CR) 10jenkins-bot: [V: 04-1] monolog: explicitly declare logstash as debug for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265718 (owner: 10Giuseppe Lavagetto) [13:02:44] (03PS2) 10Ori.livneh: monolog: explicitly declare logstash as debug for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265718 (owner: 10Giuseppe Lavagetto) [13:02:58] (03CR) 10Ori.livneh: [C: 032] monolog: explicitly declare logstash as debug for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265718 (owner: 10Giuseppe Lavagetto) [13:03:10] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.072 second response time [13:03:25] (03Merged) 10jenkins-bot: monolog: explicitly declare logstash as debug for sessions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265718 (owner: 10Giuseppe Lavagetto) [13:03:28] <_joe_> ori: I'm deplying it [13:03:37] <_joe_> you go do what you woke up for :) [13:03:40] nono [13:03:47] let me do it, i'll push bblack's change too [13:03:52] <_joe_> ok [13:04:10] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:04:47] ori: thanks! :) [13:04:58] <_joe_> ori: there *might* be a server rebooting atm [13:05:14] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: If443f3c80: monolog: explicitly declare logstash as debug for sessions (duration: 00m 34s) [13:05:18] sync-common: 100% (ok: 465; fail: 0; left: 0) [13:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:48] !log ori@tin Synchronized php-1.27.0-wmf.10/extensions/MobileFrontend: I08cdf37a1: Use TitleSquidURLs hook to purge mobile URLs directly (Bug: T124165) (duration: 00m 33s) [13:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:56] ^ bblack [13:16:31] ori: thanks! [13:17:36] 6operations, 7discovery-system: confctl should provide tags information after writing data - https://phabricator.wikimedia.org/T124413#1955242 (10ema) [13:18:10] (03CR) 10Alexandros Kosiaris: [C: 032] Fix upload.beta.wmflabs.org docroot path [puppet] - 10https://gerrit.wikimedia.org/r/265526 (owner: 10Alex Monk) [13:18:16] (03PS2) 10Alexandros Kosiaris: Fix upload.beta.wmflabs.org docroot path [puppet] - 10https://gerrit.wikimedia.org/r/265526 (owner: 10Alex Monk) [13:18:33] (03CR) 10Alexandros Kosiaris: [V: 032] Fix upload.beta.wmflabs.org docroot path [puppet] - 10https://gerrit.wikimedia.org/r/265526 (owner: 10Alex Monk) [13:22:48] (03PS1) 10Giuseppe Lavagetto: wikidata: disable cronjobs temporarily [puppet] - 10https://gerrit.wikimedia.org/r/265721 [13:28:48] (03PS2) 10Alexandros Kosiaris: WIP: DONT MERGE. cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 [13:33:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks good and puppet compiler says noop, minor inline nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:33:30] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/1646/mw1026.eqiad.wmnet/ puppet compiler link. Effectively a noop" [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [13:35:13] akosiaris: did you see https://gerrit.wikimedia.org/r/#/c/265709/ ? [13:35:48] somehow related to your SCA cleanup above. before _joe_'s change, puppet-merge->conftool-sync was reporting some citoid-related issue over and over [13:37:05] (03CR) 10Alexandros Kosiaris: "https://gerrit.wikimedia.org/r/#/c/265541/ has been updated to account for this" [puppet] - 10https://gerrit.wikimedia.org/r/265709 (owner: 10Giuseppe Lavagetto) [13:37:18] bblack: yup, I 've updated it accordingly [13:37:49] ok cool [13:38:06] and learned my lesson about using ddp instead yyp in this case [13:38:44] well, those are single lines vim commands but I think you get my point [13:39:03] heh [13:39:15] yeah I think in vi, it's all good :) [13:50:15] (03PS2) 10Alex Monk: Move apache includes into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) [13:50:44] (03PS3) 10Alex Monk: Move apache includes into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) [13:50:56] (03PS2) 10Alex Monk: beta: Remove deployment.wmflabs.org VHost that doesn't actually resolve [puppet] - 10https://gerrit.wikimedia.org/r/265548 [13:51:04] (03PS2) 10Alex Monk: mediawiki: Move www.wikimedia.org portal into wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/265642 [13:51:15] (03PS2) 10Alex Monk: beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 [13:55:56] (03CR) 10Alexandros Kosiaris: [C: 032] Move apache includes into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [14:02:30] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [14:02:30] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [14:02:40] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [14:02:50] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:01] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:10] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:20] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:20] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:31] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:40] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:41] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:46] <_joe_> wat? [14:03:53] <_joe_> akosiaris: ^^ [14:04:20] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures [14:04:30] <_joe_> Error: /Stage[main]/Mediawiki::Web::Prod_sites/File[/etc/apache2/sites-enabled/wikimedia-common.incl]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/mediawiki/apache/sites/wikimedia-common.incl [14:04:39] <_joe_> revert now :P [14:06:43] argh [14:07:26] maybe just a dependency issue on mv and config and reload? [14:07:29] <_joe_> or maybe a puppetmaster fart? [14:07:34] (as in, fixes itself after) [14:07:39] or yeah, could be a master-side race [14:07:40] not clear to me what's wrong there.. [14:07:43] that happens on renames [14:07:51] <_joe_> Krenair: what bblack said [14:07:57] the client could get the config update but not the file-move, or vice-versa [14:08:18] the change was a damn noop [14:08:23] yeah [14:08:33] what bblack says makes sense [14:08:44] the best way around those races I've found (as opposed to splitting up the change in an ugly way), is to disable the agent on the affected hosts and then re-enable like 10 minutes after the puppet-merge [14:08:52] <_joe_> well, it really doesn't, but well, puppet [14:09:07] <_joe_> bblack: that's 400+ hosts in this case [14:09:15] that's what salt is for :) [14:09:32] I think not [14:09:40] I am looking at icinga and it's only those hosts [14:09:45] <_joe_> not what? [14:09:58] <_joe_> oh yeah I said 400+ hosts on which to disable puppet [14:10:03] akosiaris: often the race is tiny, and "those hosts" is the ones that were all executing the agent in a short window of time [14:10:42] so, just forcing a puppet run on "those hosts" will fix the issue, no ? [14:10:59] which is what I am already doing [14:10:59] <_joe_> it should, trying one right now [14:11:03] and yes [14:11:51] forcing via salt a puppet run on these hosts [14:12:11] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:10] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:14:35] (03PS2) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) [14:15:01] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:15:21] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:15:31] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:15:42] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:16:01] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:16:01] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:11] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:11] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:30] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:31] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:06] !log cr1-eqord: turning up BGP with Zayo [14:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:37] 6operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1955351 (10elukey) Something that might be interesting: https://httpd.apache.org/docs/2.4/mod/event.html#how-it-works Disabling mod_deflate could be good if we plan to te... [14:26:38] (03CR) 10Luke081515: [C: 031] Add import source for ru.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52) [14:27:34] (03CR) 10Luke081515: Add 2 sites to $wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [14:29:32] (03PS4) 10Luke081515: Config changes for gu.wikiquote.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263614 (https://phabricator.wikimedia.org/T121853) (owner: 10Mdann52) [14:36:15] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 10Traffic: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955365 (10BBlack) 3NEW [14:39:14] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 10Traffic: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955373 (10BBlack) From looking at runJob logs, I've initially started to suspect something related to htmlCa... [14:40:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM. That's definitely not our IP space. First introduced in https://gerrit.wikimedia.org/r/#/c/78394/" [puppet] - 10https://gerrit.wikimedia.org/r/265551 (owner: 10BBlack) [14:41:09] (03PS2) 10Alexandros Kosiaris: firewall: use correct networks for $INTERNAL_V6 [puppet] - 10https://gerrit.wikimedia.org/r/265551 (owner: 10BBlack) [14:41:14] (03CR) 10Alexandros Kosiaris: [V: 032] firewall: use correct networks for $INTERNAL_V6 [puppet] - 10https://gerrit.wikimedia.org/r/265551 (owner: 10BBlack) [14:49:57] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 10Traffic: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955387 (10BBlack) (also, note for posterity: https://gerrit.wikimedia.org/r/#/c/265713/ was related to this:... [14:54:30] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:30] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 70145 bytes in 0.576 second response time [15:06:41] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:51] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.068 second response time [15:20:50] PROBLEM - Host mw1202 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:20] RECOVERY - Host mw1202 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [15:22:30] !log created translate tables on ruwikimedia T121766 [15:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:10] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:20] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 70153 bytes in 0.577 second response time [15:29:51] PROBLEM - Host mw1123 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:20] RECOVERY - Host mw1123 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [15:30:31] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:50] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 70146 bytes in 1.415 second response time [15:33:30] PROBLEM - Host mw1195 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:41] RECOVERY - Host mw1195 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:35:01] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:10] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.397 second response time [15:38:50] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:05] !log aqs: increased compression block size on per-article table from 128k to 256k; expectation is to further increase compression ratio & reduce seeks on rotating disks [15:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:50] PROBLEM - Host mw1118 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:01] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 70146 bytes in 1.752 second response time [15:42:21] RECOVERY - Host mw1118 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [15:47:00] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:01] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 70145 bytes in 1.301 second response time [15:49:21] PROBLEM - Host mw1128 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:34] (03PS1) 10Ema: codfw: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265742 (https://phabricator.wikimedia.org/T109286) [15:50:51] RECOVERY - Host mw1128 is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [15:51:06] (03CR) 10BBlack: [C: 031] codfw: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265742 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [15:53:47] !log Finished migrating mobile traffic to text cluster in codfw (Mexico + green US states on this map https://phabricator.wikimedia.org/T114659) [15:53:50] PROBLEM - Host mw1226 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:50] PROBLEM - Host mw1138 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:31] RECOVERY - Host mw1138 is UP: PING OK - Packet loss = 0%, RTA = 2.67 ms [15:54:41] RECOVERY - Host mw1226 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [15:54:48] (03CR) 10Giuseppe Lavagetto: [C: 031] codfw: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265742 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [15:54:51] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:10] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:01] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.072 second response time [15:57:11] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 70145 bytes in 0.752 second response time [15:57:55] 6operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-family which is not available on Jessie - https://phabricator.wikimedia.org/T103325#1955622 (10akosiaris) Any news on this ? [16:01:01] PROBLEM - Host mw1227 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:20] (03CR) 10Ema: [C: 04-2 V: 032] codfw: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265742 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [16:02:45] (03CR) 10Ema: [C: 032] codfw: remove varnish-fe,nginx services from mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/265742 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [16:02:50] RECOVERY - Host mw1227 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [16:10:41] PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.801 second response time [16:12:51] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 70145 bytes in 1.118 second response time [16:13:06] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/163814 (owner: 10Hashar) [16:13:21] PROBLEM - Host mw1205 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:41] PROBLEM - Host mw1124 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:40] RECOVERY - Host mw1205 is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [16:15:11] RECOVERY - Host mw1124 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [16:21:11] fluorine / is full again, can someone investigate? [16:21:47] !log restarted statsv on hafnium [16:21:49] ^ phedenskog [16:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:03] -rw-r--r-- 1 udp2log udp2log 125G Jan 22 11:56 session.log [16:22:05] paravoid: ^ [16:22:15] i'll just delete it [16:22:18] no [16:22:21] _joe_ did that earlier [16:22:28] ok [16:22:34] <_joe_> I didn't delete it [16:22:34] can someone preferrably from ops :) deal with the source of the problem [16:22:42] <_joe_> I stopped it from writing [16:22:52] <_joe_> another big outlier is DBPerformance.log [16:22:55] <_joe_> 95 G [16:23:14] reacting to CRITICAL disk space alerts is ops 101 [16:23:51] The db performance one could probably be sampled [16:23:58] <_joe_> jynus: do you use dbperformance logs on fluorine? [16:24:04] not at all [16:24:07] _joe_: AaronSchul.z does [16:24:07] <_joe_> hoo: we might just use logstash for those [16:24:13] <_joe_> hoo: oh ok [16:24:23] they are, in fact, useless to me [16:24:25] multiple opsens should be monitoring icinga and reacting on such alerts (even if it is to say "in a meeting, can someone else do it?") [16:25:02] only very few of us are actually responding to these and it kinda takes its toll [16:30:29] 10Ops-Access-Requests, 6operations: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1955771 (10EBernhardson) [16:33:55] 6operations, 10ops-eqiad: RMA Juniper EX-UM-2X4SFP UPLINK - https://phabricator.wikimedia.org/T124436#1955793 (10Cmjohnson) 3NEW a:3Cmjohnson [16:36:47] <_joe_> !log all api appservers in eqiad have been restarted [16:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:15] !log Troubleshooting mw1228 [16:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:43] <_joe_> cmjohnson1: what's up with mw1228? [16:42:02] _joe_ https://phabricator.wikimedia.org/T122005 [16:42:08] bad disk [16:42:21] <_joe_> oh ok oolld ticket [16:42:26] I need a few things off of it for the Dell ticket [16:42:28] <_joe_> I restarted all the api appservers today [16:42:46] it was powered off [16:43:30] <_joe_> cmjohnson1: k, just checking, the rolling restart didn't kill any server in eqiad, but it murdered a few in codfw [16:43:31] PROBLEM - Host mw1228 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:24] I did see that. I am going to replace the disk now sine I have spares. Do you want to do re-install? I don't wanna mess with anything you are doing [16:44:58] 6operations, 6Services, 3Mobile-Content-Service: mobileapps service_checker flapping on scb - https://phabricator.wikimedia.org/T118383#1955817 (10mobrovac) 5Open>3Resolved a:3mobrovac This has been resolved quite some time ago, closing. [16:45:35] <_joe_> cmjohnson1: I'm actually going off now :) [16:46:14] !log anomie@tin Synchronized php-1.27.0-wmf.11/includes/session/SessionBackend.php: Fix T124409, part 1 (duration: 00m 33s) [16:46:17] okay...have a great weekend [16:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:43] <_joe_> you too... and safe travels with the blizzard :) [16:46:47] !log anomie@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: Fix T124409, part 2 (duration: 00m 32s) [16:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:23] !log truncating 100GB DBPerformance.log on fluorine, compressed backup available [16:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:37] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1955824 (10Cmjohnson) Congratulations: Work Order SR923370958 was successfully submitted. [16:52:32] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation, and 2 others: schedule a daily run of ContentTranslation analytics scripts - https://phabricator.wikimedia.org/T122479#1955827 (10Nuria) [16:52:58] (03PS1) 10Giuseppe Lavagetto: monolog: reduce on-disk logging of DBPerformance to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265758 [16:53:04] <_joe_> jynus: ^^ [16:53:14] <_joe_> care to +1 this, if you agree? [16:53:28] I was looking at it, because I think it is not the only offender [16:53:46] I'm ok with it it is just that I do not thing it will be enough [16:54:04] <_joe_> well, to get us through the weekend, I hope [16:54:11] (03CR) 10Jcrespo: [C: 031] monolog: reduce on-disk logging of DBPerformance to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265758 (owner: 10Giuseppe Lavagetto) [16:54:30] can I deploy? [16:55:07] well, I was hoping the root cause would be fixed earlier than that [16:55:07] <_joe_> 1 sec [16:55:10] (03PS2) 10Giuseppe Lavagetto: monolog: reduce on-disk logging of DBPerformance to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265758 [16:55:17] <_joe_> now you might :) [16:55:24] <_joe_> I'll keep compressing old logs [16:56:08] don't see the difference? [16:56:10] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1955848 (10Cmjohnson) [16:56:19] (03CR) 10Jcrespo: [C: 032] monolog: reduce on-disk logging of DBPerformance to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265758 (owner: 10Giuseppe Lavagetto) [16:56:25] (03PS1) 10Reedy: Add redis.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265760 [16:56:30] <_joe_> jynus: space before the ) [16:56:34] ah [16:56:44] (03CR) 10Reedy: [C: 032] Add redis.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265760 (owner: 10Reedy) [16:58:55] !log jynus@tin Synchronized wmf-config/InitialiseSettings.php: monolog: reduce on-disk logging of DBPerformance to warning (duration: 00m 32s) [16:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:18] no errors? Did you depool/restarted finished? [16:59:32] <_joe_> jynus: yes, I decided not to do appservers [16:59:37] <_joe_> too late in the day [16:59:45] (03Merged) 10jenkins-bot: Add redis.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265760 (owner: 10Reedy) [17:00:25] :-) [17:00:49] I do not see errors on fatalmonitor == I see the same errors than 5 minutes ago [17:00:53] !log reedy@tin Synchronized docroot and w: Extra noc symlinks (duration: 00m 32s) [17:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:27] DBPerformance.log stopped growing, I just think it is not the only one growing fast [17:02:40] it's not :P [17:02:54] <_joe_> I just want to get us through the weekend [17:03:05] let me do a quick calculation [17:03:08] <_joe_> gzipping the files that failed to compress yesterday should help [17:03:43] that should've saved a couple of hundred gig or more [17:03:55] <_joe_> more [17:04:42] <_joe_> 152G Jan 22 07:43 DBPerformance.log-20160122 [17:04:57] !log mobileapps deploying bba45456 [17:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:13] BTW, paravoid- I routinelly fix stat* low disk boxes, aside obviously from db ones, in very creative ways [17:05:47] the fact that you do not see it happening doesn't mean it doesn't happens [17:06:24] Are they still going? [17:06:25] lol [17:09:30] <_joe_> jynus: the only two things that really ballooned out are session.log and DBPerformance, it seems [17:10:26] 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1955894 (10Reedy) 3NEW a:3Anomie [17:10:28] yes, _joe_ I have to give you that, I was wrong [17:10:49] but you now I prefer to be wrong and check than not to check at all [17:12:08] (03CR) 10Krinkle: "@20after4: Yes and no. Mostly yes and no problem." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [17:12:26] 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1955904 (10Reedy) [17:12:43] 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1955894 (10Reedy) [17:13:18] 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1955909 (10Reedy) a:5Anomie>3None [17:13:32] ottomata: kafka1020 has been having flapping disk free space warnings [17:14:23] I will report this changes on the relevant tickets so they do not rely on it and assuming things are fixed :-) [17:15:50] 6operations, 10ops-codfw, 5Patch-For-Review: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1955925 (10greg) >>! In T124282#1954805, @Joe wrote: > @greg I was actually waiting for Papaul to take a look before permanently depooling this server from rotation. We... [17:17:20] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1955929 (10Reedy) [17:17:55] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1955932 (10akosiaris) Repeating @faidon's comment from the gerrit change > Why are we not owning this domain? I don't think we sho... [17:19:10] "Wikipedia loading slowly" reported in https://phabricator.wikimedia.org/T124417 - anytthing else wanted / recommended besides a traceroute? [17:20:23] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1955935 (10akosiaris) @Dzahn, since we will be doing this anyway, doing it twice for `codfw` as well, doesn't seem to be that much extra trouble so I say we do it. Correct me if I am wrong though. I am a bit... [17:20:26] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1955936 (10faidon) I discussed this with @JanZerebecki in person during the dev summit. I maintain that we should only be handling d... [17:20:55] (03CR) 10Alexandros Kosiaris: [C: 031] admin: add dc-ops to install-server, allow puppet agent -t -v [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [17:24:30] RECOVERY - Disk space on fluorine is OK: DISK OK [17:25:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [17:27:46] 6operations, 6Commons, 10MassMessage, 10MediaWiki-JobQueue: Not all MassMessage sent - https://phabricator.wikimedia.org/T124441#1955944 (10Steinsplitter) 3NEW [17:28:44] andre__: maybe proxy settings [17:29:16] !log anomie@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes: Fix T124406 (duration: 00m 35s) [17:29:17] but if it's multiple users it suggests something on the WMF side or with routing, I guess [17:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 151 seconds ago with 0 failures [17:30:27] can somine take a look at https://phabricator.wikimedia.org/T124441 ? i need them sent out today. [17:30:41] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [17:32:22] (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [17:33:52] Last chance to object to me just deleting the 20 days worth to Logstash events that got classified as 2015-01-* instead of 2016-01-* [17:34:30] Nobody noticed for 20 days so I'm guessing that the data there isn't worth the 3 days it would take me to move it into the right indices [17:34:36] Steinsplitter: MassMessage is apparently broken at the moment, but people are looking into it [17:34:46] thx [17:35:07] Steinsplitter: although… that breakage is different. https://phabricator.wikimedia.org/T124414 [17:39:02] <_joe_> !log removed an archived CirrusSearchRequests.log on fluorine, now we have enough room for the weekend [17:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:12] (03PS1) 10EBernhardson: Stop generating the CirrusSearchRequests log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265772 [17:44:02] (03CR) 10EBernhardson: [C: 032] Stop generating the CirrusSearchRequests log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265772 (owner: 10EBernhardson) [17:44:14] !log running migrateAccount.php --attachbroken over lists on T74791 [17:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:48] (03Merged) 10jenkins-bot: Stop generating the CirrusSearchRequests log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265772 (owner: 10EBernhardson) [17:45:31] I am going to revert the depool of pc1001- it will be better than keeping it depooled all weekend, and I do not trust myself with deployments during the weekend [17:45:41] hope you understand [17:45:43] (03CR) 10Dzahn: [C: 04-1] "i think trebuchet is still used for new repos , so it's needed" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [17:46:03] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Stop logging the CirrusSearchRequests channel (duration: 00m 32s) [17:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:26] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1956019 (10GWicke) 5Open>3Resolved a:3GWicke A basic event bus is now available in production, and is being populated with edit events from MediaWiki. Cons... [17:48:31] (03PS1) 10EBernhardson: Stop syncing CirrusSearchRequests from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/265773 [17:49:09] jynus: is https://phabricator.wikimedia.org/T124406 looking better enough to call it fixed? (s7 qps) [17:49:15] (03PS1) 10Jcrespo: Repooling pc1001 to not leave it depooled for several days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265774 [17:49:23] (03PS2) 10EBernhardson: Stop syncing CirrusSearchRequests from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/265773 [17:49:38] (03PS2) 10Jcrespo: Repooling pc1001 to not leave it depooled for several days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265774 [17:52:04] (03CR) 10Jforrester: [C: 031] "Scheduled for Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258206 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester) [17:52:21] bd808, plase give me some minutes [17:52:35] yeah, no rush [17:52:57] I think it is, but my monitoring has some lag, it certainly is better, I want to confirm that it has gone back to pre-deployment levels [17:52:59] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1956051 (10Dzahn) a:3Dzahn [17:53:42] !log manually attaching User:Mower Genetics and User:Themeetingplace because they made edits somehow (T74791) [17:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:03] (03PS1) 10Dzahn: dhcp: let ruthenium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/265777 (https://phabricator.wikimedia.org/T122328) [17:54:04] not only the master was affected, one slave had also higher load, so I am checking that, too [17:55:00] (03CR) 10Jcrespo: [C: 032] Repooling pc1001 to not leave it depooled for several days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265774 (owner: 10Jcrespo) [17:55:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [17:55:19] (03CR) 10Mobrovac: [C: 04-1] "One minor detail, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [17:59:14] (03PS2) 10Dzahn: dhcp: let ruthenium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/265777 (https://phabricator.wikimedia.org/T122328) [17:59:56] (03CR) 10Dzahn: [C: 032] dhcp: let ruthenium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/265777 (https://phabricator.wikimedia.org/T122328) (owner: 10Dzahn) [18:00:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [18:01:43] (03CR) 10Dzahn: "has been uploaded in 2014 ...bump" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [18:02:42] !log anomie@tin Synchronized php-1.27.0-wmf.11/includes/user/User.php: Fix T124414 (duration: 00m 33s) [18:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:55] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T118468#1955936" [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [18:04:37] (03CR) 10Dzahn: "removing self" [puppet] - 10https://gerrit.wikimedia.org/r/263745 (owner: 10JanZerebecki) [18:05:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [18:07:22] 6operations, 6Discovery, 7Elasticsearch: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1956116 (10demon) 3NEW [18:07:23] (03PS5) 10Subramanya Sastry: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 [18:07:37] ostriches: heh, quarterly review outcome ^? :) [18:07:42] (03CR) 10Dzahn: "probably good but needs labs people to weigh in and it's been a couple months in my queue" [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) (owner: 10Tim Landscheidt) [18:08:13] greg-g: Yeah, paravoid and I were talking about it in the chat. At least see if it's possible/needed. [18:08:26] * greg-g nods [18:08:35] we can't see the chat, just in case you were wondering [18:08:46] (03CR) 10Dzahn: "the one it depends on has been abandoned. i dont know the status anymore" [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [18:09:46] the chat? [18:09:55] the chat in bluejeans for the quarterly review [18:10:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:10:32] got it, thx [18:13:45] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1956162 (10Legoktm) a:3Legoktm Per IRC discussion we're going to just run the resetGlobalUserTokens.php script to force a reset of all user tokens, which will log e... [18:15:02] !log running CentralAuth's resetGlobalUserTokens.php to force session resets for all users T124440 [18:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:10] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:16:30] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:17:15] (03CR) 10Mobrovac: [C: 031] Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [18:18:59] ori, mobrovac has +1ed https://gerrit.wikimedia.org/r/#/c/264032/ and the followup patch https://gerrit.wikimedia.org/r/#/c/265628/ if you want to take a look. [18:23:22] (03Abandoned) 10BBlack: mobile-lb: use text caches as LVS backends [puppet] - 10https://gerrit.wikimedia.org/r/258459 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [18:40:22] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Aborting pc1001 maintenance (duration: 00m 31s) [18:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:50] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [18:41:02] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:41:19] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1956329 (10Ata) OTRS is being [[ https://www.transifex.com/otrs/ | localised ]] on Transifex. Am I right that it will be upgraded to 5.0 including translations present there on February... [18:42:17] 6operations, 6Commons, 10MassMessage, 10MediaWiki-JobQueue: Not all MassMessage sent - https://phabricator.wikimedia.org/T124441#1956333 (10matmarex) T124414 has been fixed and the fix backported, can you try again and see if it works now? [18:48:08] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Search-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1956377 (10EBernhardson) 5Resolved>3Open @akosiaris I finally got around to testing out the usage of this. The eqiad... [18:51:44] !log "repairing" enwiki.oldtable on dbstore1001 [18:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:32] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 10Traffic: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1956423 (10BBlack) p:5High>3Unbreak! This is getting worse now. vhtcpd can't forward messages as fast as... [18:54:16] bblack: suppose we didnt' figure out the issue with htmlCacheUpdates? [18:54:18] Reedy: An answer at T119829 would be nice^^ [18:54:27] aude: no, I still have no idea [18:54:29] not sure i can help much right now though :/ [18:54:40] I need help from people who understand that layer of things, it's not really my area [18:54:47] bblack: we haven't pushed out new wikidata code since last week, for one thing [18:54:50] 2) [18:54:57] but it's becoming a real operational problem. we're probably losing purges. [18:55:02] i noticed last night that i was getting freshly cached wikipedia pages [18:55:24] hpoefully we're losing pointless purges. we have no idea the rate at which they might be overflowing buffers at this point [18:55:25] i was investigating https://phabricator.wikimedia.org/T47839 (not poking at anything) [18:55:38] and the pages that were broken were suddenly all purged [18:56:08] (i didn't tell anyone the names of the pages and they weren't edited, nor were their items) [18:56:11] aude: htmlCacheUpdate jobs were backed up [18:56:19] we increased the rate at which they are executed [18:56:27] ori: ok [18:56:28] that would explain why they were suddenly all purged [18:56:46] but the reason increasing the rate was needed in the first place is that we have many, many more such jobs than we used [18:56:47] to [18:56:55] hm :/ [18:57:03] ori: is there some way we can go back to throttling them at the queue level in general? maybe at a higher rate than before, but some kind of throttle knob would be nice [18:57:05] and we suspect wikidata is generating most of them [18:57:13] without a throttle there, they can just get dropped on the floor when the rate is too high [18:58:10] aude: the increase seems to have happened in stages: on 12/04 we went from ~5/s to ~7.5s (assuming i am using the right units); on 12/11 it went to ~20/s, and on the 20th we went to ~27/s [18:58:23] aude: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1453466175.547&target=MediaWiki.jobqueue.inserts_actual.htmlCacheUpdate.rate&from=-80days [18:58:27] ok [18:58:42] i don't think we have a task for that yet [18:58:43] current state of affairs is the per-cache-machine vhtcpd daemons are now effectively the throttle. As their queue grew too large, it takes too much CPU to manage the queue, so they're stuck at 100% of one CPU core and dribbling them out at a fixed and decreasing rate heh [18:58:45] could you file one, possibly? [18:58:51] ori: I did, above [18:58:55] oh great [18:58:58] https://phabricator.wikimedia.org/T124418 [18:59:20] thank you [18:59:48] looking to see when we deployed new code [19:01:24] bblack: i'll work on a patch to throttle htmlcacheupdate jobs more [19:01:39] ori: thanks [19:03:33] we had new code last week and i think december 9 [19:03:45] and week of december 4 would have been deployment freeze [19:04:54] (03PS1) 10Ottomata: Refactor MirrorMaker puppetization [puppet/kafka] - 10https://gerrit.wikimedia.org/r/265789 (https://phabricator.wikimedia.org/T124077) [19:06:21] it may not be code, it may be things like new template relationships or an increase in wikidata link ref usage on other wikis, etc [19:06:35] it may not be wikidata, too [19:06:38] * aude nods [19:06:40] but it's spread over a lot of wikis [19:07:19] wasn't there an arbitrary access enabled on wikidata on dicember? [19:07:20] and wikidata has a fair number of htmlCacheUpdate jobs, and I suspect more than other wikis, those jobs end up hitting a whole lot of pages per job for the data refs [19:08:58] debian folks, which is better for packaging? specifying `/usr/bin/env foo` for the shebang or `/usr/bin/foo`? [19:09:25] 6operations, 6Commons, 10MassMessage, 10MediaWiki-JobQueue: Not all MassMessage sent - https://phabricator.wikimedia.org/T124441#1956519 (10Legoktm) Eh, this is probably different: legoktm@terbium:~$ ./jobs.sh commonswiki MassMessageJob: 0 queued; 137 claimed (0 active, 137 abandoned); 0 delayed [19:09:37] 6operations, 6Discovery, 7Elasticsearch: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1956521 (10bd808) I might be easiest to use nginx as a reverse proxy in front of port 9200 ElasticSearch traffic if all you are hoping to add is TLS support using something similar... [19:09:47] jynus: for wikispecies on dec 2, as well as mediawiki.org and wikinews [19:10:03] and meta-wiki on dec 15 [19:10:16] think this wouldn't have much impact [19:10:46] 6operations, 6Commons, 10MassMessage, 10MediaWiki-JobQueue: Not all MassMessage sent - https://phabricator.wikimedia.org/T124441#1956525 (10Legoktm) Heh, actually it's the same: ``` 2016-01-22 14:50:16 mw1167 commonswiki exception ERROR: [d4c530b5] /rpc/RunJobs.php?wiki=commonswiki&type=MassMessageJob&maxt... [19:11:44] Steinsplitter: ^^ will finish that after I find food [19:13:55] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1956536 (10akosiaris) >>! In T74109#1956329, @Ata wrote: > OTRS is being [[ https://www.transifex.com/otrs/ | localised ]] on Transifex. > Am I right that it will be upgraded to 5.0 inc... [19:14:33] legoktm: thx [19:17:46] One thing I want to show you, aude is https://logstash.wikimedia.org/#dashboard/temp/AVJqw-I3ptxhN1XaMeTj (this is not related to the job issue) [19:18:15] nor I think it is caused by wikidata, but some timeouts should definitelly tuned for it [19:19:03] hmmm [19:19:21] 13:05 wikidata maintenance is disabled [19:20:07] subbu: all the ruthenium data from /mnt/data has been copied now [19:20:10] either it creates 90% of the traffic or it needs some throttling/tuning [19:21:35] (03PS1) 10Ori.livneh: Cut the number of dedicated htmlCacheUpdate runner loops by half [puppet] - 10https://gerrit.wikimedia.org/r/265792 [19:22:25] bblack: ^ [19:22:29] (i'm not proud of it) [19:22:38] (03CR) 1020after4: "@Krinkle: cool, that all sounds good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [19:22:45] but i don't want to hand-edit 20 yaml files or whatever [19:23:03] ok [19:23:22] (03CR) 10BBlack: [C: 031] Cut the number of dedicated htmlCacheUpdate runner loops by half [puppet] - 10https://gerrit.wikimedia.org/r/265792 (owner: 10Ori.livneh) [19:23:41] (03CR) 10Ori.livneh: [C: 032] Cut the number of dedicated htmlCacheUpdate runner loops by half [puppet] - 10https://gerrit.wikimedia.org/r/265792 (owner: 10Ori.livneh) [19:24:04] i'll force a puppet run on the jobrunners and restart them [19:24:59] thanks [19:25:14] I've been playing with trying to get vhtcpd to catch up faster on one host, but no luck so far [19:25:14] (03CR) 10Aklapper: [C: 031] Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [19:25:37] (e.g. using prlimit + gdb to bump the memlock ulimit and then mlockall() the process to avoid pagefaults heh) [19:27:32] in a recent stat sample there, the incoming purges are 176/s and the outbound ones 17/s heh [19:27:52] I think it was able to output faster than that, before its internal queue backlog grew so big that managing queue memory slowed it down [19:30:42] maybe moving the url parsing code from the receiver to the individual purgers? [19:30:59] hmmm? [19:30:59] it's probably saner to just not touch it and fix whatever is causing the number of purges to spike [19:31:23] https://github.com/wikimedia/operations-software-varnish-vhtcpd/blob/master/src/receiver.c#L149-L193 [19:31:36] it's just one daemon, one process, one thread, running all of the code in vhtcpd [19:31:45] oh I see what you mean [19:32:24] yeah, could help! [19:32:46] oh, no, confused by own code [19:32:49] it is only parsed once [19:33:17] receiver_read_cb() is what pulls multicast off the wire. after parsing it enqueues it to the (single) queue [19:33:31] yeah, i didn't read all the source, i figured there were multiple worker threads / processes consuming the queue [19:33:32] then multiple senders pull from that queue to send to multiple varnishds [19:33:40] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#1956656 (10jcrespo) 3NEW a:3jcrespo [19:33:54] it's just all eventloop stuff, the senders to the 2x varnishes are independent sets of io events in the same threads [19:34:08] (03PS1) 10Aaron Schulz: Revert "Bump $wgJobBackoffThrottling to lower the htmlcacheupdate backlog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265794 [19:34:18] and then the queue has two virtual heads and one real head. when both varnishes have consumed an item, the real head considers it consumed from the queue. [19:34:20] and what is the bottleneck? [19:34:35] the bottleneck is CPU speed right now, vhtcpd is using 100% of a CPU core [19:34:45] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1956667 (10Rjd0060) >>! In T74109#1956329, @Ata wrote: > OTRS is being [[ https://www.transifex.com/otrs/ | localised ]] on Transifex. > Am I right that it will be upgraded to 5.0 inclu... [19:34:52] or but that's because it has ~500MB and growing of data enqueued [19:35:11] it wasn't at 100% when the queue was smaller. I think it's burning more and more time managing the queue as the queue embiggens [19:35:45] before today, the queue size was limited to 256MB, and since the spike they've occasionally overflowed that limit (which just wipes the queue and starts over) [19:35:55] today I bumped it to 1GB hoping to avoid those period wipes and loss of purges [19:36:03] apparently that doesn't really help :) [19:36:06] (03CR) 10Aaron Schulz: [C: 032] Revert "Bump $wgJobBackoffThrottling to lower the htmlcacheupdate backlog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265794 (owner: 10Aaron Schulz) [19:36:22] AaronSchulz: what is that going to do? [19:37:20] ori: bblack i can maybe take more look later, but also maybe hoo or daniel has some ideas [19:37:28] another factor here, is that vhtcpd's code by-design prefers reading new multicast purges to getting existing ones out of the queue (in terms of event priority and all that) [19:37:47] (03Merged) 10jenkins-bot: Revert "Bump $wgJobBackoffThrottling to lower the htmlcacheupdate backlog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265794 (owner: 10Aaron Schulz) [19:37:47] because once queued the purge is relatively safe, but if we're not pulling multicast off the network fast enough, they could be lost to UDP buffer loss [19:38:10] that prioritization is probably hurting now, as the queue is growing faster than it can drain heh [19:38:23] but otherwise we'd probably be dropping on the input side silently (may still be anyways) [19:38:43] (and probably at some point won't have power and internet for a while) [19:38:56] do you mind if i take a look on one host? (if so, which?) -- i'm not expecting to have any insights you haven't already had, but i want to get better at this [19:39:15] nothing obvious stands out, but we do collect various wikidata-related metrics on https://grafana.wikimedia.org/dashboard/db/wikidata [19:39:18] like entity usage [19:39:18] ori: lower the purge rate a bit [19:39:55] ori: try cp1066 [19:39:58] (nothing obviosu to me there that matches those dates, but worth to investigate more) [19:40:14] ori: that's the one I live-hacked (without restarting the process) to mlockall() its memory in case queue pagefaults were an issue somehow [19:40:59] mutante, great .. so, reimaging in progress then? and after that copy data over .. which will take couple days? [19:41:08] if we're looking to fix vhtcpd to cope, tbh the best approach would be to add SO_REUSEPORT to the bound listeners (assuming that works with multicast) and spawn several parallel processes [19:42:06] !log aaron@tin Synchronized wmf-config/CommonSettings.php: Revert "Bump $wgJobBackoffThrottling to lower the htmlcacheupdate backlog" (duration: 00m 32s) [19:42:10] that and/or thread internally for recv + 2x send, but then we have to deal with a multi-threaded queue [19:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:24] hey, starting yesterday I've been getting logged out (seemingly after a period of inactivity) within the same browser session. Is this a known issue? (cc. anomie?) [19:42:54] ^ bd808, andre__ [19:42:57] subbu: yea, just changed DHCP config to use jessie, was going to ask you if i can do it anytime and then reboot. saw that /home has a bit but gwicke said already it shouldnt have stuff that cant be recreated [19:42:57] anomie [19:43:02] sorry andre__, meant to ping anomie [19:43:54] (03PS1) 10EBernhardson: Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) [19:44:05] mutante, sure. [19:44:27] we'll see how good our puppetization is and we'll fill in missing pieces / fix broken stuff after. [19:44:30] subbu: (are you blocked on me for CR?) [19:44:43] there are 2 ready to merge puppet patches in gerrit. [19:44:45] (03CR) 10jenkins-bot: [V: 04-1] Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [19:45:21] PROBLEM - check_load on db1025 is CRITICAL: CRITICAL - load average: 36.20, 31.48, 18.70 [19:45:28] ori, not right away .. as long as it is done around the time ruthenium comes back online with data. should be fine. [19:46:01] (03CR) 10EBernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [19:46:11] schedules downtime for ruthenium and services [19:46:11] I think that is FR [19:46:44] ori: right now that daemon is running pretty steady at ~168/s in and ~11/s out [19:46:48] bblack: it's spending 99% of its time in assert_queue_sane [19:47:00] sounds logical [19:47:11] we could rebuild without the assertions, since they've never failed once in practice heh [19:47:16] yeah [19:47:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1956756 (10Ottomata) +1 for analytics-search-users [19:47:34] I thought we didn't have the profile stuff on these hosts? [19:47:45] https://dpaste.de/eU41/raw [19:47:54] what profile stuff? [19:48:12] yeah I guess it's just the debug symbols for the kernel we might not have, I got confused and assumed perf was unavail :) [19:48:33] but maybe we fixed the debug sym thing with moritzm's kernels a while back too heh [19:49:47] ori: "perf" [19:49:57] nod [19:50:28] i did have to install debug symbols for vhtcpd (the vhtcpd-dbg package). sorry, i should have checked with you and !logged [19:50:36] got a bit eager [19:51:14] :) [19:51:21] anyways, I'll build a new one [19:51:22] (03CR) 10EBernhardson: "not sure why jenkins is failing this, it's failing in the python tests and no python code was changed here. Additionally the failure messa" [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [19:51:27] without the assertions on heh [19:51:37] but we can't restart without losing the current in-memory backlog either :/ [19:52:58] (03CR) 10Ottomata: Removing code that generates pageviews using legacy definition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [19:53:16] 6operations: labtestservices2001.wikimedia.org.crt - https://phabricator.wikimedia.org/T124374#1956778 (10RobH) So anyone can just self-sign a certificate, but I think we actually use our own internal CA. The simple answer is: openssl x509 -req -days 365 -in domain.csr -signkey domain.key -out domain.pem BUT th... [19:53:31] (03PS3) 10Ottomata: Stop syncing CirrusSearchRequests from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/265773 (owner: 10EBernhardson) [19:53:48] (03CR) 10Ottomata: [C: 032 V: 032] Stop syncing CirrusSearchRequests from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/265773 (owner: 10EBernhardson) [19:55:07] so sad, vhtcpd hasn't had a bugfix or even a rebuild since Sept 2013 [19:55:22] (03PS2) 10EBernhardson: Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) [19:55:59] well I guess it was rebuilt, but not updated in source terms [20:04:31] !log ruthenium - rebooting for reinstall [20:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:16] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1956829 (10Ottomata) Had a meeting with @Joe and @paravoid (and others yesterday), and we decided to move forward with this and another Kafka relate procurement requ... [20:06:32] (03PS1) 10BBlack: wrap complex assert funcs in NDEBUG checks [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/265806 [20:06:34] (03PS1) 10BBlack: 0.0.11 release stuff [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/265807 [20:08:15] 6operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1956834 (10Ottomata) [20:08:18] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1956835 (10Ottomata) [20:08:22] 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1956833 (10Ottomata) [20:13:28] 6operations, 10Analytics-Cluster, 10EventBus, 6Services: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#1956846 (10Ottomata) 3NEW a:3Ottomata [20:13:42] (03PS1) 10BBlack: Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265810 [20:13:43] (03PS1) 10BBlack: vhtcpd (0.0.11-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265811 [20:13:44] 6operations, 10EventBus, 6Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#1956858 (10Ottomata) [20:13:46] (03CR) 10Ori.livneh: [C: 031] wrap complex assert funcs in NDEBUG checks [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/265806 (owner: 10BBlack) [20:15:10] RECOVERY - check_load on db1025 is OK: OK - load average: 0.17, 0.44, 4.04 [20:27:02] 6operations: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1956923 (10Dzahn) p:5Triage>3Normal [20:27:12] 6operations: Reinstall magnesium with jessie - https://phabricator.wikimedia.org/T123713#1956924 (10Dzahn) p:5Triage>3Normal [20:27:51] 6operations, 5Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#1956926 (10Dzahn) p:5Triage>3Normal @Krenair yea, i asked about it and Mark and Rob told me it works just very slowly. [20:28:48] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1956928 (10Dzahn) @JKrauska any update on zendesk ticket #9727 ? or could you add me to CC of that one please? [20:29:25] 6operations, 7domains: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1956930 (10Dzahn) p:5Triage>3Low [20:29:34] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1956931 (10Dzahn) p:5Triage>3High [20:37:17] (03PS1) 10Andrew Bogott: Moved designate domain IDs into hiera [puppet] - 10https://gerrit.wikimedia.org/r/265822 [20:37:50] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1956940 (10Dzahn) [20:41:51] RECOVERY - DPKG on ruthenium is OK: All packages OK [20:41:56] subbu: ^ [20:42:03] subbu: it's back with jessie and puppet ran [20:42:17] we are seeing the expected issue with upstart but it finishes [20:42:21] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:42:56] mutante, ok, thanks. [20:44:51] ii nodejs 4.2.4~dfsg-1~bpo8+1 [20:48:20] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1956971 (10Dzahn) - the whole /mnt/data has been copied to server osmium. ``` root@ruthenium:/mnt/data# rsync -avz /mnt/data/ rsync://osmi... [20:53:20] 6operations, 7domains: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1956988 (10BBlack) Those seem amazingly low relative to our overall traffic rates... they might be candidates for parking, IMHO. I suspect in general typos are less-common than they used to be, because most peopl... [20:54:20] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:54:35] (03CR) 10Andrew Bogott: [C: 032] Moved designate domain IDs into hiera [puppet] - 10https://gerrit.wikimedia.org/r/265822 (owner: 10Andrew Bogott) [20:55:24] (03CR) 10BBlack: [C: 032 V: 032] wrap complex assert funcs in NDEBUG checks [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/265806 (owner: 10BBlack) [20:55:35] (03CR) 10BBlack: [C: 032 V: 032] 0.0.11 release stuff [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/265807 (owner: 10BBlack) [20:55:53] (03CR) 10BBlack: [C: 032 V: 032] Merge branch 'master' into debian [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265810 (owner: 10BBlack) [20:57:59] (03PS2) 10BBlack: vhtcpd (0.0.11-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265811 [20:58:01] (03PS1) 10BBlack: remove no-create-orig from gbp.conf [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265828 [20:58:11] (03CR) 10BBlack: [C: 032 V: 032] remove no-create-orig from gbp.conf [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265828 (owner: 10BBlack) [20:58:24] (03CR) 10BBlack: [C: 032 V: 032] vhtcpd (0.0.11-1) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265811 (owner: 10BBlack) [20:58:49] 6operations, 10ops-codfw, 5Patch-For-Review: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1957012 (10Papaul) a:3Papaul [20:59:51] (03CR) 10Merlijn van Deen: [C: 031] "Sorry for letting this hang for so long. Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) (owner: 10Tim Landscheidt) [21:02:23] (03CR) 10Merlijn van Deen: [C: 031] dynamicproxy: Use lua-json package instead of liblua5.1-json [puppet] - 10https://gerrit.wikimedia.org/r/263230 (owner: 10Tim Landscheidt) [21:03:22] (03CR) 10Merlijn van Deen: [C: 031] apt: Remove extra space in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/263380 (owner: 10Tim Landscheidt) [21:05:19] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove unneeded inheritance in checker role [puppet] - 10https://gerrit.wikimedia.org/r/265197 (owner: 10Yuvipanda) [21:05:59] (03PS1) 10BBlack: add notes to remember how to build this thing [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265831 [21:06:01] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957042 (10ssastry) Is it possible to get sudo access as well? [21:06:24] (03CR) 10BBlack: [C: 032 V: 032] add notes to remember how to build this thing [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/265831 (owner: 10BBlack) [21:06:41] (03PS1) 10Andrew Bogott: Revert "Moved designate domain IDs into hiera" [puppet] - 10https://gerrit.wikimedia.org/r/265832 [21:07:01] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Point shadow to correct master host [puppet] - 10https://gerrit.wikimedia.org/r/265199 (owner: 10Yuvipanda) [21:07:18] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove inheritance from bastion role [puppet] - 10https://gerrit.wikimedia.org/r/265203 (owner: 10Yuvipanda) [21:07:46] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove inheritance in role from compute [puppet] - 10https://gerrit.wikimedia.org/r/265204 (owner: 10Yuvipanda) [21:07:59] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove inheritance from mailrelay [puppet] - 10https://gerrit.wikimedia.org/r/265205 (owner: 10Yuvipanda) [21:08:11] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957056 (10Dzahn) You should have the same access as before. I didn't change anything about the access group. They should be applied just li... [21:08:20] (03CR) 10Merlijn van Deen: [C: 031] tools: Remove role inheritance from static hosts [puppet] - 10https://gerrit.wikimedia.org/r/265202 (owner: 10Yuvipanda) [21:08:37] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:08:40] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove inheritance from gridengine master role [puppet] - 10https://gerrit.wikimedia.org/r/265207 (owner: 10Yuvipanda) [21:09:56] (03CR) 10Merlijn van Deen: [C: 04-1] "can we make class gridengine::shadow_master just use that hiera parameter directly?" [puppet] - 10https://gerrit.wikimedia.org/r/265199 (owner: 10Yuvipanda) [21:09:57] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957060 (10Dzahn) The mount point /mnt/data that had all the data we copied was not fully puppetized. I did: lvrename /dev/ruthenium-vg/_p... [21:10:34] (03CR) 10Andrew Bogott: [C: 032] Revert "Moved designate domain IDs into hiera" [puppet] - 10https://gerrit.wikimedia.org/r/265832 (owner: 10Andrew Bogott) [21:10:53] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Move bigbrother to services nodes [puppet] - 10https://gerrit.wikimedia.org/r/265193 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [21:11:28] 6operations, 10ops-codfw: note/label the allocated ulsfo-eqidfw xconnects that aren't in active use (two of them) - https://phabricator.wikimedia.org/T124069#1957066 (10Papaul) 5Open>3Resolved Complete [21:11:38] mutante, i used to be able to sudo before on ruthenium but not right now. [21:11:46] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove inheritance from services role [puppet] - 10https://gerrit.wikimedia.org/r/265198 (owner: 10Yuvipanda) [21:12:11] it asks me for a password [21:12:30] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Move toolwatcher to services [puppet] - 10https://gerrit.wikimedia.org/r/265206 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [21:12:32] subbu: you have the sudo commands to control the parsoid service, but not ALL ALL [21:12:40] it's the admin group called parsoid-admins [21:12:49] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove role inheritance from gridengine shadow [puppet] - 10https://gerrit.wikimedia.org/r/265208 (owner: 10Yuvipanda) [21:12:52] %parsoid-admin ALL = NOPASSWD: /usr/sbin/service parsoid * [21:12:52] %parsoid-admin ALL = NOPASSWD: /usr/sbin/service parsoid-rt-client restart [21:12:57] %parsoid-admin ALL = (parsoid-rt) NOPASSWD: /home/parsoid-rt/update-code.sh [21:15:01] bblack /cc gwicke: hello! i'm getting mostly just 404s using RESTBase from Colorado (mobile-lb.codfw.wikimedia.org). folks elesewhere say it's working for them no problem. is something going on with one of the servers? [21:15:03] ori: yeah with the NDEBUG stuff right, the new vhtcpd binary only spends about 13% on its own C code, in http_parser stuff and such. the kernel socket io stuff is dominant now. [21:15:06] there is another group, parsoid-roots, but that has only catrope in it currently [21:15:34] mutante, i see. i was trying to run a sudo npm install in /srv/testreduce to avoid having to checking in the node modules .. [21:15:39] the groups are applied by hostname so they did not change by reinstall [21:15:46] niedzielski: looking... [21:16:18] bblack: thanks! let me know if i can provide more information. i've tried several devices over here and they all come back with 404s most of the time [21:17:14] !log running migrateAccount.php --attachbroken over list of all unattached users (T74791) [21:17:17] mutante, i suppose we got sudo access at some point for ALL ALL that was not puppetized. in any case .. i think if ALL ALL is not recommended for ruthenium, then i'll try to check in the modules into git and have them be checked out by puppet. [21:17:18] subbu: i understand if you need more permissions but i would also avoid manual npm install if the goal is also to puppetize it [21:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:41] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1957100 (10Denniss) Happened again today: https://commons.wikimedia.org/wiki/File:KutlugAtaman.JPG Was overwritten, reverted b... [21:19:47] yes, we'll run into this npm install issue whenever we get a new server. [21:20:17] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:20:56] niedzielski: can you give a specific URL example? [21:20:59] subbu: if it was really unpuppetized root access that would be really bad. let's fix it with a real access request .. or even better if it is not needed because puppet can do it [21:21:19] or we can add needed commands to the parsoid-admins [21:21:28] mutante, understood. [21:21:31] bblack: https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Brazil [21:22:15] i am going to deal with this by checking in the node modules into the repo in a separate branch as ori had suggested earlier. if i run into other things, i'll check in again about it. [21:22:30] from the office, both https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Brazil and https://en.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Brazil work, while from Colorado only the latter does [21:22:44] yeah that's because codfw is in a different state, it's in transition [21:22:53] just trying to figure out why that new state breaks restbase still... [21:22:58] (it doesn't break MW) [21:23:17] RB relies on the en.m.wikipedia.org -> en.wikipedia.org rewrite [21:23:22] yes, that still happens [21:23:28] (03PS2) 10Krinkle: Consistently use require_once for MWVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263415 [21:23:36] bblack: https://phabricator.wikimedia.org/P2520 [21:23:38] (or else MW would be broken for mobile output too) [21:23:46] twentyafterfour: Do you want to merge/deploy https://gerrit.wikimedia.org/r/263415 at some point? [21:23:53] subbu: cool, thank you. let me know if you run into blockers [21:24:29] gwicke: ok, found it [21:24:46] it's not that RB release on the actual Host: header rewrite [21:24:57] it's that we're still stuffing the Host: header pointlessly into the backend URL: [21:25:00] if (req.url ~ "^/api/rest_v1/") { [21:25:03] set req.url = "/" + req.http.host + regsub(req.url, "^/api/rest_v1/", "/v1/"); [21:25:09] and req.http.host is no longer correct at that point in time with the new changes... [21:25:24] will look into a VCL hack... [21:25:37] 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#1957157 (10mobrovac) 3NEW [21:26:23] bblack: so the new code only manipulates bereq, while the old canged req? [21:26:29] *changed [21:26:34] yes [21:26:41] *nod* [21:26:45] and it only does it at the last moment, when a it's about to pass/miss to a backend [21:27:27] for RB, it would be desirable to do this early, as the response is the same between them anyway [21:27:32] (03PS2) 10Rush: diamond: add stats to nfsd collection [puppet] - 10https://gerrit.wikimedia.org/r/265519 [21:27:47] but, it might be cleaner to just not use the .m. variant at all [21:28:12] Krinkle: ok [21:28:19] well the request comes in from the user as the .m. variant, that's the real Host: [21:28:37] we have to do a transformation to not have it be .m., which I'm going to push down to where the other mangling happens... [21:29:22] downside is that it'll needlessly fragment the cache [21:30:18] true :) [21:30:33] the whole X-Subdomain thing is a mess to begin with :/ [21:31:06] yeah [21:31:16] another consideration is purging [21:31:29] well [21:31:36] so far we didn't consider the need to purge .m. separately [21:31:36] purging is "correct" from the MW POV [21:31:47] and MW nows purges .m. separately, too [21:31:51] but yeah [21:32:06] I can refactor everything so that RB still gets the early hostname rewrite and MW doesn't [21:32:21] (for now!) [21:33:20] just discussing with the mobile folks if they can stop referencing .m. for API requests altogether [21:41:01] (03PS1) 10Andrew Bogott: Moved designate domain IDs into hiera [puppet] - 10https://gerrit.wikimedia.org/r/265844 [21:41:09] gwicke: really, purging is already kinda messed up for RB in the general case [21:41:24] since we rewrite req.url only in the backends, but not the frontends, you have to purge both variants via HTCP [21:41:33] (the user-facing form of the URL and the backend-facing one) [21:42:35] bblack: yeah; I guess we should consider dropping support for .m. altogether, so that we can stop worrying about it [21:42:44] for RB, that is [21:42:58] well it's still a caching issue [21:43:07] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957241 (10greg) Anything else needed here? Or is this complete now? [21:43:42] how much is RB caching in varnish these days? [21:44:06] all the new end points are cached in Varnish [21:44:07] (should I be worried if VCL changes effectively-invalidate it for the users mapped to codfw?) [21:44:14] performance-worried I mean [21:44:17] including the pageview api, summary, mobile apps etc [21:45:18] (03PS1) 10Rush: diamond: monitor nscd behavior for ldap clients [puppet] - 10https://gerrit.wikimedia.org/r/265847 [21:45:20] I would prefer if caching worked in codfw, as making those responses cacheable does make a significant performance difference [21:45:41] several of those end points are not stored in RB otherwise, and only rely on Varnish [21:45:46] sorry I didn't mean "make them uncacheable" [21:45:59] I just meant, if my changes effectively cause a one-shot loss of current cache contents [21:46:20] but come to think of it, that doesn't matter either if they're all currently 404ing [21:46:23] oh, that wouldn't be the end of the world [21:46:24] there are no cache contents to lose [21:46:32] as long as things continue to be cached going forward [21:46:40] (03PS2) 10Andrew Bogott: Moved designate domain IDs into hiera [puppet] - 10https://gerrit.wikimedia.org/r/265844 [21:46:57] (03CR) 1020after4: [C: 032] Consistently use require_once for MWVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263415 (owner: 10Krinkle) [21:47:01] bblack: at this point, it's about performance & not so much about falling over [21:47:22] (03Merged) 10jenkins-bot: Consistently use require_once for MWVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263415 (owner: 10Krinkle) [21:47:42] Krinkle: deploying https://gerrit.wikimedia.org/r/#/c/263415/ [21:48:10] !log anomie@tin Synchronized php-1.27.0-wmf.11/includes/: Fix T124468 (duration: 00m 38s) [21:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:36] Sigh. Forgot to 'git pull' before syncing again... [21:49:52] !log anomie@tin Synchronized php-1.27.0-wmf.11/includes/: Fix T124468, for real this time (duration: 00m 36s) [21:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:00] (03CR) 10Andrew Bogott: [C: 032] Moved designate domain IDs into hiera [puppet] - 10https://gerrit.wikimedia.org/r/265844 (owner: 10Andrew Bogott) [21:50:48] gwicke: also, we have mobile cache-frag anyways. even if we normalize your url component with /desktop_domain/, it's still cache-differentiating on the mobile hostname in the Host: header... [21:51:20] (03PS1) 10Dzahn: ruthenium: switch rsyncd setup over from osmium [puppet] - 10https://gerrit.wikimedia.org/r/265849 (https://phabricator.wikimedia.org/T122328) [21:51:29] I think I can fix that too though [21:52:14] (03PS2) 10Dzahn: ruthenium: switch rsyncd setup over from osmium [puppet] - 10https://gerrit.wikimedia.org/r/265849 (https://phabricator.wikimedia.org/T122328) [21:52:26] (03CR) 10Dzahn: [C: 032] ruthenium: switch rsyncd setup over from osmium [puppet] - 10https://gerrit.wikimedia.org/r/265849 (https://phabricator.wikimedia.org/T122328) (owner: 10Dzahn) [21:53:42] bblack: I wonder if we should instead let RB requests for .m. domains fail hard for everybody; the mobile team is looking into replacing references to .m. with the main project domain [21:54:02] (03PS1) 10Hashar: beta: update hostname to have .deployment-prep. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265850 [21:54:24] there should be no existing external users relying on .m. working, so it's a good opportunity to drop it [21:54:27] grr... really wish I could figure out why sync-dir is sooo slow [21:59:12] gwicke: except that currently, if a client is mixing requests to .m. and desktop domains, they don't SPDY-coalesce [21:59:23] that may get fixed down the road, possibly soon, but it's not a gaurantee yet [21:59:58] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957349 (10Dzahn) data is being copied back from osmium to /mnt/data/ now... and will take a while i'll close this once that is done as well [22:00:34] (03PS1) 10BBlack: Restbase mobile-via-text fixup 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/265851 [22:00:36] (03PS1) 10BBlack: Restbase mobile-via-text fixup 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/265852 [22:05:42] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957361 (10mobrovac) >>! In T122328#1957349, @Dzahn wrote: > i'll close this once that is done as well I think we should consider this a vi... [22:06:09] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1957364 (10BBlack) @Denniss - problems today are unrelated, they're from general random purge loss due to: T124418 [22:06:46] !log upgrading vhtcpd on all caches [22:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:13] 6operations, 10ops-codfw, 5Patch-For-Review: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1957370 (10Papaul) @Joe i found a 4GB kit memory onsite that came from Tampa. I replaced the bad memory and the system is back up with a total of 12 GB like before. Alt... [22:07:27] wtf ... [22:07:38] 22:02:40 sync-dir failed: /srv/mediawiki-staging/./php-1.27.0-wmf.1/vendor/oyejorge/less.php/lib/Less/Version.php has content before opening RECOVERY - Host mw2039 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [22:08:08] wmf.1 ? [22:10:55] apparently wmf.1 is still checked out on tin... [22:11:05] 6operations, 10ops-codfw: mw2087 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124299#1957398 (10Papaul) 5Open>3Resolved [22:11:15] were you running sync-dir . twentyafterfour? [22:11:33] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1957401 (10Dzahn) @akosiaris I just wanted to avoid putting multiple "misc" sites/apps on the same server with the official releases. It would mean if one of them has an (security) issue the others might be a... [22:11:37] Krenair: yeah [22:11:53] legoktm: yeah, wmf.1 needs to go [22:12:02] but the whole thing is weird [22:12:04] maybe get rid of everything up to.... .8? .9? [22:12:16] hey, if we switch to long lived branches, we won't have to automate the removal of old brancehs any more! [22:13:31] (03PS6) 10Subramanya Sastry: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 [22:13:33] (03PS14) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [22:13:35] (03PS1) 10Subramanya Sastry: Clone the 'ruthenium' branch of testreduce [puppet] - 10https://gerrit.wikimedia.org/r/265856 [22:13:59] PROBLEM - puppet last run on mw2039 is CRITICAL: CRITICAL: puppet fail [22:14:01] Krenair: yeah that's what I'm doing... [22:15:10] 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1957407 (10Papaul) You welcome. I will open a task to track down all the elder appservers that need upgrade and we can schedule the upgrade. Thanks. [22:15:25] (03CR) 10Subramanya Sastry: "While this strategy works for the testreduce repo, it won't work for the visualdiff repo because it has a binary dependency on canvas whic" [puppet] - 10https://gerrit.wikimedia.org/r/265856 (owner: 10Subramanya Sastry) [22:15:43] ok I'm going to run the full scap [22:16:06] (03CR) 10Subramanya Sastry: "just a rebase (+ conflict resolved)." [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [22:16:22] (03CR) 10Subramanya Sastry: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [22:16:35] !log twentyafterfour@tin Started scap: deploy https://gerrit.wikimedia.org/r/#/c/263415/ and clean up old branches [22:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:17:07] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 10Traffic: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1957410 (10BBlack) p:5Unbreak!>3High @ori cut the rate down a bit with: https://gerrit.wikimedia.org/r/26... [22:17:19] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957413 (10Dzahn) i understand those are related, for purposes of tracking the remaining Ubuntu systems (blocker parent task) this is resolv... [22:17:40] (03CR) 10BBlack: [C: 032] Restbase mobile-via-text fixup 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/265851 (owner: 10BBlack) [22:18:20] we still have permissions issues? "cannot delete non-empty directory: php-1.27.0-wmf.1" [22:18:37] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957414 (10ssastry) There is {T118778} for that. [22:18:52] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1957415 (10Dzahn) 3NEW a:3Dzahn [22:19:03] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1957415 (10Dzahn) a:5Dzahn>3None [22:19:54] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1957430 (10Dzahn) ah, thanks! i just made T124480 but that's a duplicate then [22:20:31] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1957415 (10Dzahn) [22:20:37] RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:21:02] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1957442 (10ssastry) 5declined>3Open Puppetization is more or less done and ruthenium has... [22:21:57] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1957449 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/265628/ https://gerrit.wikimedia.org/r/#/c/264032/ [22:22:14] 6operations, 6Parsing-Team, 10Parsoid, 6Services: parsoid roles: convert upstart to systemd - https://phabricator.wikimedia.org/T124480#1957450 (10Dzahn) [22:22:30] gwicke: niedzielski: 404s should be fixed now [22:22:48] part 2/2 is just cleanup to do after to reduce code duplication [22:23:04] bblack: woo! we're trying to zap the m's for a release today [22:23:28] niedzielski: I'm not sure that's wise right now [22:23:37] !log twentyafterfour@tin Finished scap: deploy https://gerrit.wikimedia.org/r/#/c/263415/ and clean up old branches (duration: 07m 02s) [22:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:44] 21:59 < bblack> gwicke: except that currently, if a client is mixing requests to .m. and desktop domains, they don't SPDY-coalesce [22:23:47] 21:59 < bblack> that may get fixed down the road, possibly soon, but it's not a gaurantee yet [22:23:48] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: Puppet has 1 failures [22:24:12] basically, if you're already loading other things from .m., making some requests to non-mdot is going to hurt perf vs sticking with .m. [22:24:23] (at least, currently. we might get that fixed eventually) [22:24:31] dbrant mdholloway gwicke bblack: ^^ ok, so maybe we don't want that -m patch today then? [22:26:37] bblack: any idea what the timeline would be? this release would only go out to our beta channel and the typical rollout to all users takes days+ (upgrades are optional) [22:26:44] bblack: are you referring to mixing mdot and non-mdot in the same application? [22:26:48] niedzielski: I'll make a task for that stuff so you have something to block/monitor. right now it's just a maybe-bullet-point at the bottom of a bigger task [22:27:01] dbrant: yeah same client browser/app/device [22:27:31] dbrant: niedzielski: afaik, you already do mix those reqs in the app, right? [22:27:44] i.e. you have some calls that skip the mdot [22:28:16] once we switch over, all requests will be to non-mdot, so we should be good... [22:28:20] (03PS3) 10Dzahn: deactivate wikiepdia.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/254049 [22:29:23] 6operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#1957475 (10BBlack) 3NEW [22:29:34] dbrant mobrovac mdholloway: i believe that is correct. we have some code that uses the desktop, others use the mobile. i do not know why [22:29:39] 6operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#1957485 (10BBlack) [22:29:41] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1957486 (10BBlack) [22:30:13] dbrant: as long as *every* request uses one or the other you're fine [22:30:44] but if the agent/device/browser/app/whatever does even one little fetch of some initial file, or an image, or a js fragment, or whatever that crosses the boundary, it's going to be slower than staying on one side completely. [22:30:59] anyways, T124482 now tracks whether/if we can eventually fix that [22:31:47] dbrant bblack mdholloway: i think mobrovac is right that we already have this problem [22:32:17] I'm sure we do, if nothing else for authenticated users all logins use login.wm.o and not login.m.wm.o (which I think is unused, and realistically should probably stay that way at this point) [22:33:04] yep. if anything, this will make things more consistent. [22:33:05] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1957516 (10BBlack) [22:33:50] !log mobileapps deployed 2900faa [22:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:43] (03PS3) 10Dzahn: installserver: rename classes with dash characters [puppet] - 10https://gerrit.wikimedia.org/r/260193 [22:36:51] (03PS4) 10Dzahn: installserver: rename classes with dash characters [puppet] - 10https://gerrit.wikimedia.org/r/260193 [22:37:19] (03CR) 10Mobrovac: "Doing npm install in Labs or locally in a Jessie container or VM might be a better solution, given that this is a step that will likely ne" [puppet] - 10https://gerrit.wikimedia.org/r/265856 (owner: 10Subramanya Sastry) [22:38:04] niedzielski: mobrovac: dbrant: agreed, consistently using the main project domain (without the .m.) seems like a step in the right direction [22:40:00] (03PS1) 10Subramanya Sastry: nginx conf that routes requests to different services on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/265863 [22:40:54] (03PS5) 10Dzahn: installserver: rename classes with dash characters [puppet] - 10https://gerrit.wikimedia.org/r/260193 [22:41:50] greg-g: https://phabricator.wikimedia.org/T124252#1957550 [22:42:24] sorry, really have to run now [22:45:08] (03CR) 10Dzahn: [C: 032] installserver: rename classes with dash characters [puppet] - 10https://gerrit.wikimedia.org/r/260193 (owner: 10Dzahn) [22:46:06] (03PS2) 10BBlack: Restbase mobile-via-text fixup 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/265852 [22:46:36] (03CR) 10BBlack: [C: 032 V: 032] Restbase mobile-via-text fixup 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/265852 (owner: 10BBlack) [22:47:09] bd808: anomie re tgr's comment, who can help push this over the finish line, as it were [22:47:27] * bd808 looks [22:47:30] mutante: my puppet-merge picked up your stuff on strontium, I think yours failed there [22:47:38] PROBLEM - salt-minion processes on ruthenium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:48:12] bblack: oh, looking, it should not affect strontium and i was looking at the installservers right now [22:48:20] only the motd changes [22:48:25] I mean the merge itself [22:48:29] ah [22:48:36] thanks! [22:48:47] puppet-merge only showed the confirm for mine, but then all this other spam on strontium heh [22:49:25] i ran normal puppet-merge on palladium..hmm [22:49:37] yeah [22:49:43] it just failed on the part where it syncs to strontium [22:49:48] should i still pull on strontium manually? [22:49:51] ok [22:50:27] PROBLEM - NTP on mw2039 is CRITICAL: NTP CRITICAL: Offset unknown [22:50:37] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:51:58] (03CR) 10Dzahn: "besides motd because of the role names, noop on carbon (and install2001, hooft, bastion4001)" [puppet] - 10https://gerrit.wikimedia.org/r/260193 (owner: 10Dzahn) [22:52:57] now we have like 1 single class left that has a "-" in the name, but it's a bit scary to touch [22:53:09] lvs/manifests/interface-tweaks.pp [22:55:12] oh right that's in my queue I think [22:56:26] worst case merging that should cause temporary puppetfails from races [22:56:58] (03CR) 10BBlack: [C: 031] lvs: rename interface-tweaks to interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/260198 (owner: 10Dzahn) [22:57:18] thank you! i'm doing this because in our global .puppet-lint.rc we turn of the checks for that [22:57:27] yeah [22:57:27] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957632 (10hashar) git clone works for me over v6 :-) There is still one comment that I dont think is formally addressed: >>! In T100519#171061... [22:57:39] (03PS3) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [23:00:19] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957636 (10BBlack) I don't really understand that quoted comment, but the ferm rules do have destination addresses that work at this time, and th... [23:00:43] (03PS2) 10Dzahn: lvs: rename interface-tweaks to interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/260198 [23:01:23] (03CR) 10Dzahn: [C: 032] lvs: rename interface-tweaks to interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/260198 (owner: 10Dzahn) [23:01:48] bd808: legoktm csteipp: given the need_token issue (see https://phabricator.wikimedia.org/T124252#1957336 ) and the fact that it is only getting worse, what is the draw back of rolling back at this point? I don't want people to be fire-fighting this all weekend [23:02:19] PROBLEM - mediawiki-installation DSH group on mw2039 is CRITICAL: Host mw2039 is not in mediawiki-installation dsh group [23:02:34] greg-g: do we know who is affected by need_token? [23:03:12] legoktm: that'd be great to know. I don't know if this means we're negatively effecting article quality (because really useful bots are not working) or what [23:03:38] gergo's latest patches look sane so we could try them, or we can roll back [23:03:56] legoktm: you mentioned somethying yesterday about not rolling back because....? [23:04:49] we just set a bunch of new cookies for people, what's going to happen when the code is expecting old cookies? I have no idea, I'm just afraid [23:06:00] bd808: do you know what the level of impact (who's effected, really) by this? [23:06:25] nope. I honestly don't think anyone does or that we have a good way to find out [23:06:54] I'm guessing its mostly bots, because if it were humans, we'd have a lot more complaints [23:07:03] I'd agree with that [23:07:09] though apparently huggle is broken? [23:07:18] https://phabricator.wikimedia.org/T124428 [23:07:57] man ya'll are making me unhappy with this choice [23:09:05] tgr's patches look pretty sane and simple [23:09:36] bd808: are you fixing https://gerrit.wikimedia.org/r/#/c/265690/ or should I? [23:09:40] ok, if you two can shepherd them through now, safely please ;), and let's see what we look like after [23:09:51] legoktm: I've got it open [23:10:06] if shit's still broken, or worse, we'll rollback [23:10:20] he accidentally deleted a comma on L186 [23:10:30] greg-g: yeah, sounds good [23:11:26] * legoktm relocates while rain has subsided [23:14:16] (03PS3) 10Rush: diamond: add stats to nfsd collection [puppet] - 10https://gerrit.wikimedia.org/r/265519 [23:15:42] FYI I'm running out for dinner soon, but I won't be far. call or text if you need me for something [23:15:52] (03CR) 10Rush: [C: 032] diamond: add stats to nfsd collection [puppet] - 10https://gerrit.wikimedia.org/r/265519 (owner: 10Rush) [23:17:08] (03PS4) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [23:17:16] (03PS5) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [23:17:58] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957814 (10chasemp) 5Open>3Resolved that comment is out dated [23:22:10] (03PS3) 10Dzahn: lvs: rename interface-tweaks to interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/260198 [23:22:34] !log restbase cassandra truncating local_group_wiktionary_T_term_definition.data [23:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:42] chasemp: yay (re phab ssh) [23:28:58] wtf https://integration.wikimedia.org/ci/job/mediawiki-phpunit-php53/105/console [23:29:32] (03CR) 10Dzahn: "noop on lvs1001" [puppet] - 10https://gerrit.wikimedia.org/r/260198 (owner: 10Dzahn) [23:30:47] legokt, bd808: thx, I rushed through a merge conflict [23:31:17] tgr: np. team work! [23:31:36] greg: the need_token thing only happens in the API so no huge deal IMO [23:31:59] well, lots of things use the api [23:32:00] and so far we have only seen it from clients with broken cookie handling [23:32:02] including apps etc [23:32:07] * greg-g nods [23:32:14] yeah, teh android app got broken by this [23:32:16] that part makes me feel better (blaming broken clients) [23:32:24] only like 450M requests per day greg-g ;) [23:32:25] :/ [23:32:36] bd808: yeah, that part then makes me unhappy again [23:32:51] what I would be worried most is accidental DOS by thrashing clients [23:33:12] but there is probably a way to blacklist them [23:33:25] (03PS3) 10Dzahn: puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) [23:33:59] (03PS4) 10Dzahn: puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) [23:34:20] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:34:37] given how gradually https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen grew this is probably affecting a lot of clients [23:35:00] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:35:27] heh,new dash classes appeared meanwhile [23:35:30] force merged both because CI is busted [23:35:31] ./manifests/role/parsoid-rt-client.pp:2 WARNING class name containing a dash (names_containing_dash) [23:35:55] tgr: do both changes need to be deployed simultaenously? [23:36:12] legoktm: was that a reply to me? not in this case, that is actually changing config of the lint check [23:36:13] no [23:36:35] mutante: no, I was talking about the cookie stuff sorry [23:36:45] legoktm: I don't think the centralauth one will actually have an effect, bots probably can't use CA cookies [23:36:54] legoktm: :) no worries, it could have fit [23:37:07] tgr: I think pywikibot does :P [23:37:23] pywikibot has correct cookie handling though [23:37:38] indeed :D [23:39:44] oh I forgot to cherry-pick [23:39:45] I'm dumb [23:40:22] greg-g: the obvious impact is about 10 bots [23:40:30] * bd808 doesn't think legoktm is dumb [23:40:49] it's possible there are other bots that are broken but not login-spamming [23:41:25] we don't log cookies and post body together anywhere AFAIK so no easy way to tell [23:42:50] syncing... [23:42:51] !log legoktm@tin Synchronized php-1.27.0-wmf.11/includes/session/CookieSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/265869/ (duration: 00m 26s) [23:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:49] !log legoktm@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/265870/ (duration: 00m 26s) [23:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:39] (03CR) 10Dzahn: "could the new file in ./files/misc also live inside a module? we'd like to get rid of the global files/misc where we can" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:45:15] the drops on https://grafana.wikimedia.org/dashboard/db/authentication-metrics are because the data isn't there yet, not an issue right? [23:46:11] (03CR) 10Dzahn: "ah, you probably did that because it has config in it" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:46:17] legoktm: yes [23:46:27] also, don't expect a huge drop from the first patch [23:46:56] legoktm: yeah, ignoring the last ~5m in graphite data is recommended for sanity preservation [23:46:57] (03PS2) 10Ottomata: [WIP] Refactor MirrorMaker puppetization [puppet/kafka] - 10https://gerrit.wikimedia.org/r/265789 (https://phabricator.wikimedia.org/T124077) [23:47:01] okay [23:47:22] anomie said most bots are broken in the other way (not sending cookies at all) [23:47:51] :/ [23:49:30] legoktm: I've got https://gerrit.wikimedia.org/r/#/c/265871/ and https://gerrit.wikimedia.org/r/#/c/265872/ queued up once you think it's safe [23:51:21] https://phabricator.wikimedia.org/T124453#1957967 sounds https://gerrit.wikimedia.org/r/#/c/265799/ related [23:52:13] tgr: caused by or should've been fixed by? [23:52:20] bd808: I think so...are you going to sync them out? [23:52:23] ori, would you be able to review the puppet patches? [23:52:23] caused by, I think [23:52:39] legoktm: I can, sure [23:52:45] ok, thanks [23:52:56] I'm going to do some non-wiki things for a bit, but will be pingable [23:53:22] legoktm: I'm just about to reach the airport, if you don't want to roll back, just disable the whole cookie check thing at login [23:54:06] * legoktm points to bd808 [23:54:11] (03PS1) 10Dzahn: parsoid-testing: rename classes with dashes [puppet] - 10https://gerrit.wikimedia.org/r/265873 (https://phabricator.wikimedia.org/T93645) [23:54:40] I .. don't know how to do that (just disable the whole cookie check thing at login) off the top of my head [23:54:50] * bd808 matches boldly forward [23:55:54] (03CR) 10Dzahn: "most are done, but now needs this too https://gerrit.wikimedia.org/r/#/c/265873/" [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:55:59] (next stop actually) [23:56:26] ok, more coherent version: there is a check on special:userlogin to show nice error messages if the user has cookies disabled [23:56:38] we might want to kill that temporarily [23:56:53] (03Abandoned) 10Dzahn: add an url-downloader service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/260770 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [23:57:13] tgr: ah. ok [23:57:34] hasSessionCookie [23:57:58] I'll look into it more if I find a spot at the airport [23:59:26] (03PS3) 10Dzahn: bastionhost: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260607