[00:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T0000). Please do the needful. [00:00:38] dbrant, what did you do when the app grew from 4 to 14 megs? [00:01:15] MaxSem: it's the Mapbox SDK :( [00:01:51] RainbowSprinkles: yep, that's working now. thanks again [00:01:53] yw [00:02:18] Yeah, it was moved to a dedicated VM and off of the shared bromine one it used with a few other micro sites [00:06:08] (03PS2) 10Awight: Give ores-admin users lsof, rather than the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/374832 (https://phabricator.wikimedia.org/T174402) [00:08:24] so, the wikitext
\nlorem ipsum\n
used to produce output html

\nlorem ipsum\n

(mind the

) and now it produces only

\nlorem ipsum\n
. none of the changes listed in changelog for 1.30/wmf.16 seems to be relevant (at least via the summary, i've checked some which mentioned parsoid and didn't find anything relevant) [00:08:35] file a bug? :P [00:09:56] well, that is obviously something which breaks things and as such should be immediatelly reverted [00:10:28] Revert what? [00:12:11] the cause ;-) [00:12:21] Well, find the cause and we can revert it [00:12:39] this is not how it should be [00:12:52] it broke things [00:12:57] like? [00:13:16] styles, scripts, gadgets [00:13:24] whatever counted with such behavior [00:14:30] Danny_B: can you show an example breakage somewhere? [00:14:44] greg-g: it's on the internet [00:14:51] ;-) [00:14:57] theoreticals are nice but showing *where* it's broken is needed [00:15:33] well, atm i can show only different visual rendering (obviously due to missing

) [00:16:01] if i had an idea where to look for the cause, i would. and since i don't i just reported what i found [00:16:10] Greg wants a URL [00:16:13] or a screenshot [00:16:15] I'm not talking cause. I'm saying: where did you notice the breakage (which wiki page)? [00:16:16] Or some pastebinned html [00:16:18] Or all 3 [00:16:22] some. thing. [00:16:32] "somethings broken" [00:16:35] "somethings fixed" [00:16:50] Reedy: Direct neural output from your eyes? [00:17:09] RainbowSprinkles: whatever freaking works, just something! :) [00:17:11] \!log fixed something for Danny_B [00:17:16] lol [00:17:17] stupid bot [00:17:39] * Danny_B must do the screenshot since pages are being re-rendered now thus linking would not help as it wouldn't probably be seen in old way [00:18:26] irc is hard :( [00:18:41] greg-g: btw i wrote earlier the *exact* change (just in general form) [00:19:13] Danny_B: doesn't help. I need a page to see it broken on [00:19:45] pls suggest some pastebin like site for images [00:20:05] imgur [00:20:09] phabricator [00:21:19] it's almost like we're already set up to take in bug reports ;) ;) [00:21:39] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3568655 (10aaron) As far as retries go, the attempts hash for wikidatawiki:htmlCacheUpdate has few entries with run counts no greater than 3. The onl in... [00:21:41] the bug is a lie. only features [00:22:13] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3568656 (10Eevans) [00:22:24] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3423217 (10Eevans) [00:23:11] I have to leave soon. If there's a there there, file a task and cc me on it. I'll see that (but not IRC) later. [00:23:44] RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 0 [00:23:55] greg-g: https://pasteboard.co/GI7vt8A.png [00:24:22] Danny_B: link to that page please [00:24:32] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3568661 (10Eevans) All decommissioning is now complete, the following hosts are free to be configured into the new cluster: - restbase1010.eqiad.w... [00:24:34] (ignore the link color change - i created the page exactly right after the deployment) [00:24:42] i'm not sure why this is so hard, give us reproducible steps [00:25:06] i can't give you *reproducible* steps, as i can't no longer provide *old* behavior [00:25:16] https://cs.wikinews.org/wiki/30._srpen_2017 (and any such page) [00:25:43] There's still wikis on .15 [00:25:48] If it's indeed a .15 -> .16 bug [00:26:11] that's why at the very beginning i wrote the original and new rendered code [00:26:27] and mentioned several times that

disappeared [00:28:56] When did it break? [00:28:57] 17:33 arlolra@tin: Finished deploy [parsoid/deploy@bd12f8a]: Updating Parsoid to 538dad7f (duration: 12m 24s) [00:28:57] 17:21 arlolra@tin: Started deploy [parsoid/deploy@bd12f8a]: Updating Parsoid to 538dad7f [00:29:25] let me check the page history, mmt [00:29:25] Replicated. [00:29:53] BLOOOOOHCKER! [00:29:53] Easy replication path: Special:Random, open resulting page in two tabs. action=purge the 2nd [00:29:58] Then you see it easy [00:30:23] good for you then [00:30:37] because i see the stuff already with new rendering [00:30:58] Hmmmmm [00:30:59] hence why i was hesitant to provide url as i thought it would be for nothing and rather confusing [00:31:05] I'll test it on the command line [00:31:19] hello, TimStarling and thanks [00:31:33] Danny_B, that's because you have some settings that alter parser output [00:31:50] eh? [00:31:52] by using them, you're also making things slower for yourself [00:31:52] like? [00:32:06] thumbnail sizes, for example [00:32:29] Any time you deviate from the site default settings basically :p [00:32:34] you = myself or you = us (n:cs:)? [00:32:40] We should rename Special:Preferences to Special:PerformancePentalties [00:32:40] :p [00:33:03] i don't change thumbsizes from default [00:33:04] you = yourself, deviate from site (as in cs.wn) defaults [00:33:05] :) [00:33:21] Anyway, digressing [00:33:54] in any case there is nothing in preferences what could have an influence on this [00:34:13] btw - if that may help - the output comes from lua [00:34:50] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637#3568685 (10ayounsi) [00:36:05] well, I see the p-wrapper when I do this in eval.php [00:36:14] He didn't say your preferences were causing it [00:36:17] Reedy: spotted between cca 00:50 and 01:03 utc+2 - the first timestamp is when i opened the tab with old behavior, the latter about when i saved the page and spotte the new behavior. then i nearly immediatelly asked here [00:36:39] have you actually put that text "

\nlorem ipsum\n
" on a page and viewed the resulting HTML? [00:37:18] let me put the exact text as produced by lua there [00:37:49] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3568719 (10RobH) So, the bios and ilom have been updated to the latest version, but the firmware on the NIC can only be flashed via the HP SPP iso boot. [00:40:41] TimStarling: when put as wikitext, it renders with

[00:40:53] so maybe the change is in the lua output? [00:40:53] when rendered via module, the

is not there [00:41:19] nobody changed the module since january ;-) [00:41:23] which template is it? Denní přehled? [00:41:33] yes and same name module [00:41:41] https://cs.wikinews.org/w/index.php?title=Modul:Denn%C3%AD_p%C5%99ehled&action=history [00:43:00] before today, i last edited the wiki yesterday about in the same time. and then everything was ok. so the change had to be deployed within last 24 hrs [00:53:26] well, if you remove the categories, then you do get a p-wrapper [00:54:42] they are outside the wrapping div, how they can have the influence? [00:55:07] so I guess it could be https://gerrit.wikimedia.org/r/#/c/371735/ [00:55:17] (and i can't remove them obviously, i can only move them elsewhere) [00:55:54] the p-wrapping code is a total joke, yes it can be non-local, you can even have non-local effects jumping over multiple lines of text [00:59:03] ugh:

\nlorem ipsum\n
->

\nlorem ipsum\n

BUT [[category:foo]]
\nlorem ipsum\n
->
lorem ipsum
[00:59:58] so category code (= metadata, nothing rendered) *outside* tag can have an influence on the inside of the tag [01:00:40] hmm neither \n after the category helps [01:01:27] ok, this should be reverted because it will break a bunch of things because many pages have categories on top or added in the middle (ie via maintenance templates etc.) [01:02:12] so anything what follows and has the
\n...\n
like code will be now rendered incorrectly [01:05:59] funny, if you put bunch of newlines between category and following div, it will render bunch of empty

's (as it regularly does) but still no

within the

[01:07:29] we are only talking about a vertical whitespace change, right? [01:07:44] that is the only difference in rendering? [01:08:42] that's *visual* part of the issue [01:09:27] it's not severe enough to make me want to revert it in production right now without talking to the developers [01:09:35] different output is the code part [01:10:09] it changes the semantics of the output, it changes the dom obviously etc [01:10:22] do you want to file the bug, or shall I do it? [01:10:57] i would appreciate if you could since you already know all the details so you can directly embed them. thank you (cc me) [01:11:54] Be sure to cross-ref T170634, whomever does file the bug [01:11:55] T170634: 1.30.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T170634 [01:12:53] imo the syntax of metadata must not have an influence on the rendered output, fortiori if it is in different level/part/whatever (the category outside
changes behavior inside
) [01:13:48] RainbowSprinkles: how come i didn't find it in changelog? was it missed by any chance? [01:13:57] or am i just blind? [01:14:53] Seemingly [01:14:53] git #c66c9aa5 - Fix link prefix/suffixes around Category and Language links. (task T2087, task T10897, task T87753) [01:14:53] T2087: Category tags produce ugly whitespace - https://phabricator.wikimedia.org/T2087 [01:14:54] T87753: Space between final 2 words in a page with ≥2 category tags is removed in arabic mediawiki - https://phabricator.wikimedia.org/T87753 [01:14:54] T10897: spaces wrongly removed - https://phabricator.wikimedia.org/T10897 [01:14:54] 3rd entry from the bottom of the core section [01:15:14] I cannot attest to your visual faculties, but it is indeed there :) [01:17:29] ah, right ;-) i ignored it because it does not mention parser, parsoid, parsing, whitespace or whatever like that, besides it mentions categories and langlinks while the change has and influence on page rendering [01:19:14] anyway, thanks and big kudos to TimStarling for debugging and finding the (likely) cause [02:08:25] 10Operations, 10Analytics: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10Tbayer) [02:11:30] (03PS3) 10Ottomata: Give ores-admin users lsof, rather than the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/374832 (https://phabricator.wikimedia.org/T174402) (owner: 10Awight) [02:11:45] (03CR) 10Ottomata: [V: 032 C: 032] Give ores-admin users lsof, rather than the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/374832 (https://phabricator.wikimedia.org/T174402) (owner: 10Awight) [02:12:13] 10Operations, 10Analytics: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568792 (10Tbayer) Regarding prioritization: While this is a clear bug, it does not affect the (from the Readers team's perspective) most impor... [02:26:10] (03PS1) 10Mattflaschen: Fix interwiki links on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) [02:26:40] (03PS2) 10Mattflaschen: Fix interwiki links on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) [02:30:43] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.15) (duration: 10m 02s) [02:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:04] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 06m 40s) [02:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:12] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 31 02:55:12 UTC 2017 (duration 7m 9s) [02:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:14] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.54 seconds [04:32:36] (03PS2) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [04:37:35] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [04:40:44] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:11:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 258.99 seconds [05:56:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89999.98 seconds [06:05:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374926 (https://phabricator.wikimedia.org/T168661) [06:13:48] (03PS1) 10Marostegui: mariadb: Update db1056 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374933 [06:14:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374926 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:14:37] (03CR) 10Nemo bis: "(+1, this is ok to merge per consistency with all the other italic wikis)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [06:15:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374926 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:16:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374926 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:16:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T168661 (duration: 00m 47s) [06:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:57] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:18:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 - T168661 (duration: 00m 47s) [06:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:12] !log Upgrade MariaDB to 10.0.32 on db1056 - T168661 [06:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:35] PROBLEM - Host scb2006 is DOWN: PING CRITICAL - Packet loss = 100% [06:25:04] RECOVERY - Host scb2006 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [06:25:08] (03CR) 10Marostegui: [C: 032] mariadb: Update db1056 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374933 (owner: 10Marostegui) [06:36:51] (03PS1) 10Marostegui: db-eqiad.php: Repool db1056 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374945 (https://phabricator.wikimedia.org/T168661) [06:43:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1056 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374945 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:44:26] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1056 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374945 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:45:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 with low weight - T168661 (duration: 00m 46s) [06:45:32] (03PS4) 10Muehlenhoff: Make the server group / Cumin alias configurable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 [06:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:40] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:46:09] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1056 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374945 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:47:42] (03CR) 10Legoktm: Add libraryupgrader puppet module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [06:47:44] (03PS5) 10Legoktm: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) [06:48:09] (03PS6) 10Legoktm: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) [06:51:08] (03CR) 10Volans: [C: 04-1] "It seems that there is a wrong exit status. Looks good otherwise." (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 (owner: 10Muehlenhoff) [07:06:49] (03PS5) 10Muehlenhoff: Make the server group / Cumin alias configurable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 [07:07:18] (03PS2) 10Giuseppe Lavagetto: requesttracker: fix further template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374736 (https://phabricator.wikimedia.org/T171704) [07:07:20] (03PS2) 10Giuseppe Lavagetto: service::node: use validate_numeric for validating parameters [puppet] - 10https://gerrit.wikimedia.org/r/374737 [07:07:22] (03PS2) 10Giuseppe Lavagetto: sysfs::conffile: use validate_numeric for number validation [puppet] - 10https://gerrit.wikimedia.org/r/374738 (https://phabricator.wikimedia.org/T171704) [07:07:24] (03PS3) 10Giuseppe Lavagetto: prometheus: avoid validate_re for an integer [puppet] - 10https://gerrit.wikimedia.org/r/374704 (https://phabricator.wikimedia.org/T171704) [07:07:26] (03PS2) 10Giuseppe Lavagetto: varnish::common::vcl: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374739 (https://phabricator.wikimedia.org/T171704) [07:07:29] (03PS2) 10Giuseppe Lavagetto: varnish: convert to string integers [puppet] - 10https://gerrit.wikimedia.org/r/374778 (https://phabricator.wikimedia.org/T171704) [07:07:31] (03PS1) 10Giuseppe Lavagetto: varnish: stringify instance ports [puppet] - 10https://gerrit.wikimedia.org/r/374946 (https://phabricator.wikimedia.org/T171704) [07:07:41] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 (owner: 10Muehlenhoff) [07:08:14] (03CR) 10Muehlenhoff: [C: 032] Make the server group / Cumin alias configurable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 (owner: 10Muehlenhoff) [07:20:18] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus: avoid validate_re for an integer [puppet] - 10https://gerrit.wikimedia.org/r/374704 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:20:35] (03CR) 10Giuseppe Lavagetto: [C: 032] requesttracker: fix further template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374736 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:20:53] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: use validate_numeric for validating parameters [puppet] - 10https://gerrit.wikimedia.org/r/374737 (owner: 10Giuseppe Lavagetto) [07:21:12] (03CR) 10Giuseppe Lavagetto: [C: 032] sysfs::conffile: use validate_numeric for number validation [puppet] - 10https://gerrit.wikimedia.org/r/374738 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:24:17] <_joe_> ouch, I did a mistake [07:25:10] need help? [07:25:20] <_joe_> nope [07:26:12] (03PS1) 10Giuseppe Lavagetto: requesttracker: fixup for template [puppet] - 10https://gerrit.wikimedia.org/r/374947 [07:26:42] (03CR) 10Giuseppe Lavagetto: [C: 032] requesttracker: fixup for template [puppet] - 10https://gerrit.wikimedia.org/r/374947 (owner: 10Giuseppe Lavagetto) [07:27:07] (03PS1) 10Marostegui: db-eqiad.php: Increase weight on db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374948 (https://phabricator.wikimedia.org/T168661) [07:29:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight on db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374948 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:31:43] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight on db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374948 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:31:57] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight on db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374948 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:32:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic on db1056 - T168661 (duration: 00m 47s) [07:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:06] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:45:15] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [07:47:44] (03PS1) 10Jcrespo: dblists: Remove s4 from db1095 [software] - 10https://gerrit.wikimedia.org/r/374949 [07:49:24] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:49:26] (03CR) 10Jcrespo: [C: 032] dblists: Remove s4 from db1095 [software] - 10https://gerrit.wikimedia.org/r/374949 (owner: 10Jcrespo) [07:55:20] (03CR) 10Reedy: [C: 04-1] "Should expose it via the noc/conf docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) (owner: 10Mattflaschen) [07:55:24] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 26 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:57:35] (03PS1) 10Muehlenhoff: Switch to Python3-compatible print() [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374950 [08:00:24] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:00:35] (03CR) 10Filippo Giunchedi: "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) (owner: 10Rush) [08:00:46] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) (owner: 10Rush) [08:01:53] !log Rename reader_feedback, reader_feedback_history, reader_feedback_pages tables on dewiki on db1092 - T174586 [08:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:06] T174586: Remove ReaderFeedback tables from wikis - https://phabricator.wikimedia.org/T174586 [08:03:28] (03PS1) 10Jcrespo: dblists: update db1069 role [software] - 10https://gerrit.wikimedia.org/r/374952 [08:05:08] (03CR) 10Volans: "Nice effort! See a couple of comments inline." (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374950 (owner: 10Muehlenhoff) [08:05:10] (03CR) 10Jcrespo: [C: 032] dblists: update db1069 role [software] - 10https://gerrit.wikimedia.org/r/374952 (owner: 10Jcrespo) [08:08:34] (03CR) 10Muehlenhoff: Switch to Python3-compatible print() (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374950 (owner: 10Muehlenhoff) [08:09:05] (03PS2) 10Muehlenhoff: Switch to Python3-compatible print() [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374950 [08:11:48] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374953 (https://phabricator.wikimedia.org/T168661) [08:14:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374953 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:15:41] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374953 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:16:19] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374953 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:16:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic on db1056 - T168661 (duration: 00m 47s) [08:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:49] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:19:06] Minor favor if anyone is available: We’ve disabled puppet on ores1001.eqiad.wmnet in order to do some stress tests, but now I need a manual run of the agent in order to get some settings over there. [08:19:14] (03PS1) 10Ladsgroup: mediawiki: Another attempt to fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374954 [08:19:28] Amir1: o/! [08:20:01] !log restart postgresql / kartotherian / tilerator / tileratorui on maps@codfw [08:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:29] awight: hey, I don't have access I guess [08:21:35] but goat morning [08:22:15] The grown-ups don’t give me that kind of car keys neither. [08:22:47] we’re able to run “lsof” on scb now, thanks to ottomata [08:23:02] weird stuff. [08:23:12] uwsgi 1566 www-data DEL REG 0,5 1250822577 /dev/zero [08:23:12] uwsgi 1566 www-data DEL REG 0,5 1250822576 /dev/zero [08:23:13] uwsgi 1566 www-data DEL REG 0,5 1250822575 /dev/zero [08:23:16] for example. [08:25:36] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374950 (owner: 10Muehlenhoff) [08:27:35] (03CR) 10Muehlenhoff: [C: 032] Switch to Python3-compatible print() [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374950 (owner: 10Muehlenhoff) [08:31:59] goatmorning Amir1! [08:33:00] If noone vocally objects I'm going to quickly backport https://gerrit.wikimedia.org/r/#/c/374957/ which adds waitForReplication() calls to a maintenance script I am about to run. [08:35:48] jfdi [08:39:32] already am [08:39:41] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Patch-For-Review, 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3569163 (10awight) We'll need a manual puppet run on ores1001. I'm seeing some weirdness, just pok... [08:40:11] !log addshore@tin Synchronized php-1.30.0-wmf.16/extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php: T172987 [[gerrit:374957|Add waitForReplication to RecalculateCognateNormalizedHashes script]] (duration: 00m 47s) [08:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:23] T172987: Match cʼh with c'h - https://phabricator.wikimedia.org/T172987 [08:40:45] I saw you waited about 1 minute for objections :P [08:41:04] !log restart postgresql / kartotherian / tilerator / tileratorui on maps@eqiad [08:41:08] Reedy: well, 7 minuites from my message to the sync! [08:41:09] also Reedy https://gerrit.wikimedia.org/r/#/c/374846/ [08:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:17] (03PS4) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) [08:41:24] RECOVERY - kartotherian endpoints health on maps-test2003 is OK: All endpoints are healthy [08:41:36] RECOVERY - kartotherian endpoints health on maps-test2002 is OK: All endpoints are healthy [08:41:44] RECOVERY - kartotherian endpoints health on maps-test2001 is OK: All endpoints are healthy [08:41:45] RECOVERY - kartotherian endpoints health on maps-test2004 is OK: All endpoints are healthy [08:42:28] ^that's an unexpected recovery... (not that I'm complaining :) [08:42:35] godog: good morning! I could use your root blessing to delete /srv/deployment/grafana on tin.eqiad.wmnet since it is no more deployed via Trebuchet :] [08:42:53] godog: we are trying to clear out /srv/deployment a bit and grafana has some .git dir with files not writable by wikidev :( [08:42:53] gehel: lol \o/ they were ack'ed since long time IIRC [08:43:47] yep, they were, we are still working on it with pnorman. But maybe the postgres restart just fixed the issue in some strange and unrelated way... [08:45:20] :) [08:45:46] !log remove /srv/deployment/grafana on naos and tin, unused - T170881 [08:45:50] hashar: {{done}} ^ [08:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:58] T170881: Cleanup /srv/deployment - https://phabricator.wikimedia.org/T170881 [08:46:11] godog: awesome thank you! [08:47:23] (03PS5) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) [08:48:07] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3569220 (10akosiaris) FWIW is worth I am for closing this and moving the other parts... [08:50:02] (03PS1) 10Marostegui: db-eqiad.php: Increase db1056 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374958 (https://phabricator.wikimedia.org/T168661) [08:50:21] (03PS6) 10Filippo Giunchedi: install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) [08:51:11] (03CR) 10Filippo Giunchedi: [C: 032] install_server: add partman for cassandra JBOD [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [08:57:10] !log contint2001 upgrading git, restarting git-daemon and zuul-merger - T161086 [08:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:23] T161086: Upgrade git package on zuul-merger instances contint1001 / contint2001 to benefit git-daemon - https://phabricator.wikimedia.org/T161086 [09:01:30] !log test reimage of restbase2001 - T169939 [09:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:43] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [09:04:17] !log contint1001 upgrading git, restarting git-daemon and zuul-merger - T161086 [09:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:29] T161086: Upgrade git package on zuul-merger instances contint1001 / contint2001 to benefit git-daemon - https://phabricator.wikimedia.org/T161086 [09:09:24] (03CR) 10Alexandros Kosiaris: [C: 032] zuul: rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar) [09:09:29] (03PS5) 10Alexandros Kosiaris: zuul: rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar) [09:09:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] zuul: rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar) [09:09:57] (03PS4) 10Alexandros Kosiaris: authdns: basic spec [puppet] - 10https://gerrit.wikimedia.org/r/341602 (owner: 10Hashar) [09:10:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] authdns: basic spec [puppet] - 10https://gerrit.wikimedia.org/r/341602 (owner: 10Hashar) [09:10:05] !log installing experimental hhvm-luasandbox package on mw1261 (canary app server) with backported patch for T171392 / T173705 [09:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:17] T171392: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 when accessed from non-English languages specified in the template - https://phabricator.wikimedia.org/T171392 [09:10:25] (03PS5) 10Alexandros Kosiaris: authdns: basic spec [puppet] - 10https://gerrit.wikimedia.org/r/341602 (owner: 10Hashar) [09:10:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] authdns: basic spec [puppet] - 10https://gerrit.wikimedia.org/r/341602 (owner: 10Hashar) [09:10:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1056 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374958 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:11:14] (03PS4) 10Alexandros Kosiaris: contint: boilerplate for spec tests [puppet] - 10https://gerrit.wikimedia.org/r/342206 (owner: 10Hashar) [09:11:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] contint: boilerplate for spec tests [puppet] - 10https://gerrit.wikimedia.org/r/342206 (owner: 10Hashar) [09:13:54] akosiaris: they are rather lame rspec stuff but that helps catching up dummy mistakes locally :] [09:15:48] hashar: yeah no worries. look fine enough to me. Thanks! [09:22:34] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1056 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374958 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:22:44] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1056 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374958 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:24:26] hashar: do you remember when we setup the debian glue job for cumin? [09:25:12] I've recently opened T173999 and I'm interested in your feedback if/when you've time, not urgent at all [09:25:12] T173999: CI job debian-glue-non-voting: add support for BACKPORTS=yes - https://phabricator.wikimedia.org/T173999 [09:25:41] volans: was done back in Feb with I6f50210148abe9f54f01d5380636d630d8a29876 [09:26:00] it is non voting though and does not trigger on master branch [09:26:05] yeah [09:26:12] that part works fine [09:26:35] I am in a meeting with zeljko right now, might look at it in half an hour, else post lunch :D [09:26:48] sure, no hurry [09:26:48] ah yeah [09:27:00] we would need to have zuul to inject a BACKPORTS parameter to the job [09:27:06] which would end up as an env variable in the job [09:27:13] yep [09:27:19] there is some python logic for that in integration/config.git under /zuul/something.py [09:27:23] (cant remember the file name) [09:27:30] should be easy [09:27:43] the thing is the job looks at the debian/changelog to find the distribution [09:27:47] most probably jessie-wikimedia [09:27:52] yep [09:28:04] you were not in a meeting? :- [09:28:05] :-P [09:28:06] and our apt hooks (in modules/package_builder ) support either jessie-wikimedia or jessie-backports [09:28:20] a combination of both need one to be set as an env variable [09:28:27] multitasking man!!! [09:29:02] indeed but I don't want to hardcode the repo name in zuul/parameter_functions.py if that is the one you're referring to [09:29:24] would be nice to have a way to pass the parameter from layout.yaml on a per-repo basis [09:30:08] Zuul has a way to tag jobs from layout.yaml but I have never played with it [09:30:14] and most probably that is on a per job basis [09:30:21] but all repositories trigger the same job :( [09:31:12] yeah that was th.cipriani was saying in the task too [09:31:24] that is a bit dirty and harcoded [09:31:30] but that is the easiest path [09:31:57] (03PS4) 10Elukey: [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [09:32:07] (03PS2) 10Volans: mediawiki: Another attempt to fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374954 (owner: 10Ladsgroup) [09:33:06] (03PS3) 10Volans: mediawiki: Another attempt to fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374954 (owner: 10Ladsgroup) [09:33:40] (03CR) 10Volans: [C: 032] mediawiki: Another attempt to fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374954 (owner: 10Ladsgroup) [09:40:36] !log installing libxml2 security updates [09:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:00] volans: so yeah essentially what tyler said [09:43:00] ok, I hoped for a cleaner solution :) [09:43:30] !log restarting apache on auth* to pick up libxml security update [09:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] volans: and I guess cumin requires packages both from jessie-backports and jessie-wikimedia isn't it ? [09:44:10] of course! [09:44:11] :D [09:44:13] :D [09:44:29] !log mwscript extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php --wiki enwiktionary --batch-size 1000000 --dry-run [09:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:50] once we migrate cumin masters to stretch this might not be true anymore though hashar, I will need to verify [09:46:05] (03CR) 10Ema: varnish: stringify instance ports (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/374946 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:47:34] volans: an alternative way is if the version string has "bpo" we could set BACKPORTS [09:47:48] so you would get something like 1.0~bpo8 (jessie-wikimedia) [09:47:54] but my package is not a BPO [09:48:01] it needs backports to build [09:48:03] and the apt hooks from package_builder would set both WIKIMEDIA=yes and BACKPORTS=yes [09:48:04] that's it [09:48:14] which would save one from having to tweak CI to inject BACKPORTS [09:48:22] ah [09:48:56] BACKPORTS=yes in the builder environment is to include backports repos while looking for dependencies for the build [09:49:17] so is completely arbitrary :D [09:49:45] and tyler had another approach which is to craft a job named debian-glue-backports [09:49:50] which would have BACKPORT=yes [09:50:11] so in the zuul layout one would: [09:50:12] test: [09:50:19] - debian-glue-backport-non-voting [09:50:32] yeah, that's an option too [09:50:33] !log addshore@terbium:~$ mwscript extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php --wiki enwiktionary --batch-size 1000000 # Should upsert 6326 rows T172987 [09:50:35] which might make it a bit more obvious that the job was run with backport [09:50:43] since it will show up in the comment reported to gerrit [09:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:49] T172987: Match cʼh with c'h - https://phabricator.wikimedia.org/T172987 [09:53:33] fine for me, if doesn't add complexity to maintain those 4 jobs [09:56:59] volans: I am crafting a change :D [09:57:12] thanks hashar! my turn for the meeting now [09:57:33] !log extensions/Cognate/maintenance/recalculateCognateNormalizedHashes.php run done, 6326 hashes recalculated, T172987 [09:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:45] T172987: Match cʼh with c'h - https://phabricator.wikimedia.org/T172987 [09:59:33] PROBLEM - DPKG on labnodepool1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:00:33] RECOVERY - DPKG on labnodepool1002 is OK: All packages OK [10:01:05] ^ that's benign, upgrade [10:03:07] (03PS1) 10Marostegui: db-eqiad.php: Restore db1056 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374969 (https://phabricator.wikimedia.org/T168661) [10:09:57] volans: https://gerrit.wikimedia.org/r/#/c/374970/ but I wanna Tyler to look at it as well [10:10:10] <_joe_> bbiab [10:10:12] thanks! I'll have a look too [10:10:49] !log restarting apache on hafnium to pick up libxml security update [10:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:05] !log restarting nginx on debug proxies to pick up libxml security update [10:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:38] (03PS6) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [10:14:18] volans: tyler usually show up early, so probably we would get it deployed by the end of our afternoon :] [10:14:22] lunch & [10:14:42] !log restarting nginx on meitnerium/archiva.wikimedia.org to pick up libxml security update [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:05] (03PS1) 10Ema: varnish: remove varnishtest-runner [puppet] - 10https://gerrit.wikimedia.org/r/374973 (https://phabricator.wikimedia.org/T150660) [10:22:41] !log restarting apache on einsteinium/tegmen (Icinga hosts) to pick up libxml security update [10:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:15] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:45] !log restarting nginx on prometheus* to pick up libxml security update [10:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:25] (03PS7) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [10:40:27] (03PS1) 10Alexandros Kosiaris: kubernetes: Add a few recommended admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374974 (https://phabricator.wikimedia.org/T170119) [10:47:27] (03PS5) 10Ema: Serve a synth error page when error body is empty in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) (owner: 10Gilles) [10:47:44] (03CR) 10Ema: [V: 032 C: 032] Serve a synth error page when error body is empty in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) (owner: 10Gilles) [10:58:26] !lof restarting uwsgi* services on graphite to pick up libxml security update [11:03:05] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89950.54 seconds [11:15:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1056 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374969 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [11:17:11] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1056 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374969 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [11:17:19] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1056 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374969 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [11:17:34] !log restarting apache on dbmonitor* to pick up libxml security update [11:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1056 original weight - T168661 (duration: 00m 47s) [11:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:33] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [11:21:45] !log restart apache2 on bohrium for libxml + gnutls security updates [11:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:35] (03PS5) 10Elukey: [WIP] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [11:38:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [11:39:27] (03PS1) 10Muehlenhoff: Remove access for dworley [puppet] - 10https://gerrit.wikimedia.org/r/374979 [11:39:32] (03PS6) 10Elukey: [WIP] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [11:48:57] (03CR) 10Alexandros Kosiaris: [C: 032] Upgrade to kubernetes 1.7.4 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/373554 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [11:50:19] (03PS8) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [11:50:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [11:55:04] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [11:55:24] PROBLEM - DPKG on mw2202 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:56:01] (03PS2) 10Alexandros Kosiaris: Reimage kubernetes1004, chlorine as stretch [puppet] - 10https://gerrit.wikimedia.org/r/374511 (https://phabricator.wikimedia.org/T170119) [11:56:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Reimage kubernetes1004, chlorine as stretch [puppet] - 10https://gerrit.wikimedia.org/r/374511 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [11:56:24] RECOVERY - DPKG on mw2202 is OK: All packages OK [11:58:13] !log restarting nginx on dataset1001/ms1001 to pick up libxml security update [11:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:39] (03PS1) 10Muehlenhoff: Mark as released [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374981 [12:03:52] (03PS7) 10Elukey: [WIP] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [12:04:53] !log reimage chlorine and kubernetes1004 T10119 [12:04:56] !log reimage chlorine and kubernetes1004 T170119 [12:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:06] T10119: Correct spelling of Category namespace in Limburgish - https://phabricator.wikimedia.org/T10119 [12:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:19] T170119: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119 [12:06:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374982 (https://phabricator.wikimedia.org/T168661) [12:07:38] (03CR) 10Muehlenhoff: [C: 032] Mark as released [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374981 (owner: 10Muehlenhoff) [12:09:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374982 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:10:32] (03PS1) 10Marostegui: mariadb: Update db1084 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374983 (https://phabricator.wikimedia.org/T148507) [12:10:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374982 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:10:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374982 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:11:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 - T168661 (duration: 00m 47s) [12:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:58] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:14:26] !log Upgrade MariaDB on db1084 - T168661 [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:14] (03CR) 10Marostegui: [C: 032] mariadb: Update db1084 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374983 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [12:27:53] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374985 (https://phabricator.wikimedia.org/T168661) [12:31:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374985 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:33:18] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374985 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:33:31] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374985 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:34:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 with low weight - T168661 (duration: 00m 47s) [12:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:32] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:36:50] (03CR) 10Volans: "I'm generally ok with this approach given the safeguards against unexpected big updates. See a couple of comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [12:39:20] (03PS1) 10Muehlenhoff: Remove access for midom [puppet] - 10https://gerrit.wikimedia.org/r/374986 [12:41:08] (03CR) 10Muehlenhoff: [C: 032] Remove access for midom [puppet] - 10https://gerrit.wikimedia.org/r/374986 (owner: 10Muehlenhoff) [12:47:20] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 13 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata] [12:48:19] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T1300). [13:00:20] !log installing augeas security updates [13:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:55] SWAT is empty [13:02:42] \o/ [13:04:04] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3569772 (10jcrespo) I do not think we should postpone the reboots too much, my proposal would be to: 0) document access to the new hosts (bare essentials) 1)... [13:04:38] (03PS1) 10Marostegui: db-eqiad.php: Increase weight on db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374988 (https://phabricator.wikimedia.org/T168661) [13:07:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight on db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374988 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:07:57] !log restart zookeeper on conf2001 for security updates (canary node) [13:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:30] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight on db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374988 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:09:40] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight on db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374988 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:10:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1084 weight - T168661 (duration: 00m 47s) [13:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:46] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [13:30:56] (03PS1) 10Marostegui: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374990 (https://phabricator.wikimedia.org/T168661) [13:33:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374990 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:35:02] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374990 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:35:26] 10Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3569829 (10Nuria) @TheDJ : noted, please ping us when those changes make it into safari's main version. The bulk of our views is on Safari 10 so if problem is fixed on next ver... [13:36:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1084 original weight - T168661 (duration: 00m 47s) [13:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:14] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [13:36:22] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374990 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:38:49] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3569839 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4022.ulsfo.wmnet'] ``` The log can be found in `/var/lo... [13:42:01] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:01] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:01] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:12] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:14] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:14] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:20] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:20] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:20] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:21] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:21] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:28] yay for more ipsec spam :P [13:42:30] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:40] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:41] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:50] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:50] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:50] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:50] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:42:51] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:42:51] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4022_v4, cp4022_v6 [13:43:00] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [13:43:02] bblack: holy spam [13:43:42] there's a ticket about it, no good solutions [13:44:16] Link? [13:45:18] well the only one that came up quickly was https://phabricator.wikimedia.org/T148976 [13:45:29] the idea in the title is decent, I'm not so sure about the other discussion in there [13:46:02] we don't really want to turn off alerting for all related ipsec nodes just because one is depooled [13:46:39] has noone forked varnish yet [13:46:46] but it would be nice if the ipsec checking scripts that runs on each node, could filter the not-conn nodes that cause alerts according to which hosts are either downtimed or missing in icinga [13:47:02] bblack: agreed [13:47:11] (03PS1) 10Filippo Giunchedi: install_server: shrink raid partition for cassandra commitlog/cache [puppet] - 10https://gerrit.wikimedia.org/r/374993 [13:47:13] (03PS1) 10Filippo Giunchedi: WIP jbod config for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/374994 [13:48:02] bblack: if i knew how to filter that i would write up a patch right now [13:50:08] it probably wouldn't be a great idea to have every ipsec node polling back to some icinga API over the network on every check, it seems kinda crazy. [13:50:43] it would seem more reliably to structure things in reverse, where icinga exports a list of non-downtimed hosts that are running ipsec checks that gets dropped in some file on ipsec hosts [13:51:00] and the they don't alert for failing peer hostnames that aren't in the list on disk [13:51:04] or something [13:51:40] Or even if the host is downtime have it not even run the check bblack [13:57:47] Zppix: the nature of the problem is that for a given ipsec cluster, the check on every host is checking the state of its connection to every other host [13:57:54] (03PS2) 10Filippo Giunchedi: install_server: shrink raid partition for cassandra commitlog/cache [puppet] - 10https://gerrit.wikimedia.org/r/374993 [13:57:56] (03PS2) 10Filippo Giunchedi: cassandra: jbod devices configuration [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) [13:58:20] Zppix: so you have, say, 40 hosts in a given cluster. When one goes down, the other 39 report a failed connection to that host (from checks that run "on" the other 39 hosts) [13:58:53] Zppix: and that check is a singular check execution on each, covering all the peer hosts in one go, because otherwise the raw count of icinga checks explodes. [13:58:57] bblack: i see... so yeah if the host is downtime ignore the fail connection? [13:59:28] right, the problem is how do you efficiently and reliably do that. the check and decision to critical happens in a script running on each host itself. [14:00:03] icinga sees 1x check per node in the cluster, and the check involves icinga using NRPE to executing an ipsec-checking script on the host itself [14:00:23] it gets kinda crazy if then all of these also call back over the network somehow to icinga to poll for downtime status [14:01:18] bblack: maybe the npre script should ask the icinga db what hosts are downtimed and then if a cluster is failing due to a downtime host ignore the failure [14:01:24] the right answer, structucally, is to split the ipsec check into per-link checks. so in a 40-node cluster, given ipv4+ipv6, each node would have 39x2=78 checks, one for each peer link [14:01:52] then we could add proper dependencies so that a downtimed host causes all links to that host from others to be ignored [14:02:06] but multiplying our ipsec checks by a factor of 78 sounds painful in terms of general monitoring load [14:02:09] (03PS3) 10Filippo Giunchedi: install_server: shrink raid partition for cassandra commitlog/cache [puppet] - 10https://gerrit.wikimedia.org/r/374993 [14:02:11] (03PS3) 10Filippo Giunchedi: cassandra: jbod devices configuration [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) [14:02:56] (03PS4) 10Filippo Giunchedi: install_server: shrink raid partition for cassandra commitlog/cache [puppet] - 10https://gerrit.wikimedia.org/r/374993 [14:03:13] "the nrpe script should ask the icinga db" is the problem in your maybe. how does the remote ipsec host efficiency gain this information from the icinga DB constantly? [14:03:23] s/efficiency/efficiently/ :) [14:04:19] it would be nice if we could define the export of a small subset of icinga db info to the nrpe script as additional input, but I'm not sure such a thing exists [14:04:53] e.g. "When executing NRPE check X, send it as input a list of all non-down hosts which have the same NRPE check configured" [14:05:08] (and then the check would only go fatal if the not-connected peers were in that list) [14:05:34] bblack: i cant be sure of specifics sadly, i can only bounce ideas, sorry [14:05:49] the input is valuable :) [14:06:08] (03CR) 10Filippo Giunchedi: [C: 032] install_server: shrink raid partition for cassandra commitlog/cache [puppet] - 10https://gerrit.wikimedia.org/r/374993 (owner: 10Filippo Giunchedi) [14:06:10] but we've stared at this a bit and there's no trivial answer. someone's going to have to implement an ugly answer :) [14:06:23] (Feel free to paste this in the task if need be bblack ) [14:07:29] <_joe_> this is a limitation in icinga [14:07:35] <_joe_> more or less [14:07:43] Agreed [14:09:14] (03PS3) 10Andrew Bogott: openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) [14:10:31] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK [14:10:31] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [14:10:31] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [14:10:33] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545135 (10Joe) @aaron so you're saying that when we have someone editing a lot of pages with a lot of backlinks we will see the jobqueue growing basical... [14:10:40] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [14:10:40] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [14:10:41] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [14:10:41] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [14:10:51] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK [14:12:32] (03CR) 10Andrew Bogott: [C: 032] openstack: refine firewall rules for controller [puppet] - 10https://gerrit.wikimedia.org/r/374644 (https://phabricator.wikimedia.org/T171494) (owner: 10Andrew Bogott) [14:13:05] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [14:13:06] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [14:13:06] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [14:13:06] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [14:13:16] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [14:13:16] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [14:13:16] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK [14:13:25] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [14:13:25] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [14:13:25] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [14:13:25] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [14:14:41] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3570018 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4022.ulsfo.wmnet'] ``` and were **ALL** successful. [14:15:45] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [14:15:46] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [14:16:56] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3570037 (10Joe) Correcting myself after a discussion with @ema: since we have up to 4 cache layers (at most), we should process any job with a root times... [14:25:54] (03CR) 10Hashar: [C: 04-1] "Taking php memcached as an example." [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [14:26:34] (03Abandoned) 10Hashar: contint: PHP packages cleanup [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [14:26:42] (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler02/7669/" [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:27:54] (03PS1) 10Alexandros Kosiaris: Rebuild for stretch [debs/cni] - 10https://gerrit.wikimedia.org/r/374998 [14:28:37] (03PS1) 10Hashar: contint: include mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/374999 [14:29:27] (03CR) 10Filippo Giunchedi: "Note this is limited to restbase2001 to start with, eventually these are all the machines in the new cluster" [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:29:32] (03PS8) 10Mforns: [WIP] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) [14:29:46] (03PS1) 10Alexandros Kosiaris: Rebuild for stretch [debs/cni] - 10https://gerrit.wikimedia.org/r/375000 [14:30:42] (03CR) 10Hashar: "That is merely "upstreaming" the puppet bit we have in integration/config. I am removing the include with https://gerrit.wikimedia.org/r/" [puppet] - 10https://gerrit.wikimedia.org/r/374999 (owner: 10Hashar) [14:31:17] (03PS1) 10Elukey: Tune the kafka-jumbo.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/375002 (https://phabricator.wikimedia.org/T174457) [14:32:24] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3570090 (10Agabi10) @Joe, that might be true for the htmlCacheUpdate jobs, but not for the refreshLinks jobs. From my understanding, the refreshLinks job... [14:37:42] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3570102 (10elukey) The last code review removes the need for the 'placeholder' logical volume and removes a unused/not-necessary par... [14:44:00] (03PS1) 10Muehlenhoff: Extend account data [puppet] - 10https://gerrit.wikimedia.org/r/375005 [14:49:05] (03CR) 10Muehlenhoff: [C: 032] Extend account data [puppet] - 10https://gerrit.wikimedia.org/r/375005 (owner: 10Muehlenhoff) [14:56:59] (03PS9) 10Elukey: [WIP] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [14:58:11] (03Abandoned) 10Alexandros Kosiaris: Rebuild for stretch [debs/cni] - 10https://gerrit.wikimedia.org/r/374998 (owner: 10Alexandros Kosiaris) [14:58:33] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3570152 (10bd808) @jcrespo's plan sounds like a good one. Working on the announce of the new cluster was already on my todo list for today, so I'll add foresh... [15:11:39] (03CR) 10Mforns: [C: 031] "The dependency is merged now, this can be reviewed and merged if OK!" [puppet] - 10https://gerrit.wikimedia.org/r/374878 (https://phabricator.wikimedia.org/T170850) (owner: 10Mforns) [15:14:18] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for mw1307=1328 T165519 [dns] - 10https://gerrit.wikimedia.org/r/374660 (owner: 10Cmjohnson) [15:14:42] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:32] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:01] !log restart zookeeper on conf200[2,3] for jvm security updates [15:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:03] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:18] that's me ^ silencing [15:25:29] 10Operations, 10monitoring, 10Patch-For-Review: Investigate check_nrpe -u option to reduce critical alerts - https://phabricator.wikimedia.org/T172131#3570218 (10herron) This is looking good so far. Going to keep an eye on it for the rest of the day before resolving [15:26:49] mobrovac,Pchelolo - hello people, I just noticed an issue with zookeeper on conf200[12] after restarting them for jvm security updates [15:27:16] conf2003 is the leader afaics and it should keep serving traffic, but please check eventbus [15:27:29] elukey: mobrovac is on vacation but what's up? [15:27:35] (03PS2) 10Ottomata: [WIP] Add reportupdater job to trigger page-creation metrics [puppet] - 10https://gerrit.wikimedia.org/r/374878 (https://phabricator.wikimedia.org/T170850) (owner: 10Mforns) [15:28:14] eventbus is fine, it's doesn't talk to zookeeper directly, only via kafka [15:29:06] yep, I meant the whole thing (so even kafka) [15:29:25] for some reason conf200[12] seems to have trouble in the logs [15:29:27] I don't really see any visible issues there [15:29:31] okok good [15:29:36] just wanted to give you an heads up [15:29:41] (03CR) 10Ottomata: [C: 032] [WIP] Add reportupdater job to trigger page-creation metrics [puppet] - 10https://gerrit.wikimedia.org/r/374878 (https://phabricator.wikimedia.org/T170850) (owner: 10Mforns) [15:30:20] ottomata: I'd use some help if you have time (zookeeper) [15:36:59] the zookeeper codfw cluster seems not working Pchelolo, checking kafka [15:40:25] (03Draft2) 10Jayprakash12345: Enable ArticlePlaceholder on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375015 (https://phabricator.wikimedia.org/T174335) [15:47:09] elukey: still in post standup, what's happening? [15:47:28] ottomata: zookeeper on conf200[123] seems down after a couple of restarts [15:47:34] (03PS1) 10Cmjohnson: Adding productin dns for mw1307-1328 T165519 [dns] - 10https://gerrit.wikimedia.org/r/375017 [15:48:08] I think that firewall rules might be the problem [15:52:08] (03CR) 10Cmjohnson: [C: 032] Adding productin dns for mw1307-1328 T165519 [dns] - 10https://gerrit.wikimedia.org/r/375017 (owner: 10Cmjohnson) [15:52:45] (03CR) 10Hashar: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373513 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans) [15:54:23] PROBLEM - Check whether ferm is active by checking the default input chain on conf2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:54:33] PROBLEM - Check whether ferm is active by checking the default input chain on conf2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:54:52] PROBLEM - Check whether ferm is active by checking the default input chain on conf2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:56:46] this is me --^ [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T1600). Please do the needful. [16:00:05] Amir1: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:27] o/ [16:01:45] legoktm: heyas, i saw your ldap request, i need to know his wmf email wikitech account, i udpated the task =] [16:01:58] samwilson on wikitech is his personal one. [16:02:23] I prefer to tie work ldap accounts to work wikitech accounts (it makes things easier if the person leaves later and remains a volunteer) [16:02:53] (03CR) 10ArielGlenn: "One comment inline. Also you will need to add the other/categoriesrdf directory to the modules/datasets/manifests/dirs.pp so it gets creat" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [16:09:03] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [16:09:27] (03PS1) 10Elukey: role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 [16:11:44] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3570463 (10Nuria) [16:12:09] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562647 (10Nuria) [16:17:57] (03PS2) 10Elukey: role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 [16:18:20] (03CR) 10jerkins-bot: [V: 04-1] role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 (owner: 10Elukey) [16:19:13] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:24] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:24] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:53] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:53] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:13] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:23] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:55] nitrogen seems returning 502s [16:21:20] ah something might have happened https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=nitrogen&refresh=1m&orgId=1 [16:21:50] the OOM killed puppetdb [16:27:06] (03PS3) 10Elukey: role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 [16:27:26] (03CR) 10jerkins-bot: [V: 04-1] role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 (owner: 10Elukey) [16:28:28] trying to figure out what is upsetting jenkins [16:29:02] ahh the comment [16:29:48] (03PS4) 10Elukey: role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 [16:31:47] Pchelolo: some impact to eventbus was definitely registered - https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&orgId=1&from=1504194351054&to=1504195533536 [16:33:21] so the zk cluster was not available, and some kafka ops were not available [16:33:40] meaning that kafka was not able to update broker [16:33:47] broker's topic metadata, etc.. [16:33:57] hopefully nothing bad was registered [16:34:37] elukey: indeed... I think CP wanted to post those messages to kafka and was stuck waiting for kafka delivery reports and then accumulated a ton of promises and as it all got back to normal it posted all at once [16:35:17] super, feel better now [16:35:36] what is the amount of messages that cp can hold when kafka is not available? [16:35:40] (curious) [16:35:54] hm.... actually I don't think I'm right [16:36:15] it looks more like the issue was with MirrorMaker [16:37:42] that one was probably affected since IIRC it uses the old consumer/producer that talks with zk [16:37:53] elukey: look for example on codfw_change-prop_transcludes_resource-change [16:38:23] there's 2 of them - one is real, the second one is made by mirrormaker [16:38:52] one of them has no spike - it's smeeth and nice. The second one goes to 0 for a while and then has a huge spike [16:39:03] and no CP executions rates were actually affected [16:39:21] so I think that spike came from MirrorMaker [16:39:28] ok so old consumer/producers using zk were affected, meanwhile new ones no [16:39:47] so nothing was really affected at all since the mirrorer topics are not used [16:39:58] s/mirrorer/mirrored [16:40:15] I'll of course write an incident report [16:40:22] but so far it seems that nothing big happened [16:41:17] (03CR) 10Giuseppe Lavagetto: [C: 031] role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 (owner: 10Elukey) [16:44:29] (03PS4) 10Filippo Giunchedi: cassandra: jbod devices configuration [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) [16:45:08] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: jbod devices configuration [puppet] - 10https://gerrit.wikimedia.org/r/374994 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [16:46:04] (03CR) 10Elukey: [C: 032] role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 (owner: 10Elukey) [16:46:09] (03PS5) 10Elukey: role::common::configcluster: allow zookeeper daemons to communicate [puppet] - 10https://gerrit.wikimedia.org/r/375023 [16:46:22] snipered by Filippo [16:47:04] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:47:34] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:47:43] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:47:43] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:47:44] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:48:13] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:48:24] RECOVERY - Check whether ferm is active by checking the default input chain on conf2001 is OK: OK ferm input default policy is set [16:48:43] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:53:13] RECOVERY - Check whether ferm is active by checking the default input chain on conf2002 is OK: OK ferm input default policy is set [16:57:14] RECOVERY - Check whether ferm is active by checking the default input chain on conf2003 is OK: OK ferm input default policy is set [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T1700). Please do the needful. [17:00:15] Nothing for ORES today. [17:16:29] RainbowSprinkles: can you add https://gerrit.wikimedia.org/r/375028 to train? [17:16:35] fixes the log error problem [17:17:05] We'll do it prior to the train :) [17:17:14] RainbowSprinkles: sweet. so no need for me to add to swat calendar? [17:17:17] (03PS1) 10Thcipriani: scap: upgrade to 3.7.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/375029 (https://phabricator.wikimedia.org/T127762) [17:17:21] Nope, I'll jfdi :) [17:28:40] (03PS3) 10Eevans: Use absolute paths for `data_file_directories` [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) [17:38:13] (03PS1) 10Ottomata: Add comment about reportupdater's usage of a my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/375033 [17:43:45] (03PS1) 10RobH: adding samwilson to admin module [puppet] - 10https://gerrit.wikimedia.org/r/375034 (https://phabricator.wikimedia.org/T174644) [17:44:08] (03CR) 10RobH: [C: 032] adding samwilson to admin module [puppet] - 10https://gerrit.wikimedia.org/r/375034 (https://phabricator.wikimedia.org/T174644) (owner: 10RobH) [17:44:34] (03CR) 10Ottomata: [C: 032] Add comment about reportupdater's usage of a my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/375033 (owner: 10Ottomata) [17:44:38] (03PS2) 10Ottomata: Add comment about reportupdater's usage of a my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/375033 [17:44:40] (03CR) 10Ottomata: [V: 032 C: 032] Add comment about reportupdater's usage of a my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/375033 (owner: 10Ottomata) [17:50:03] !log demon@tin Synchronized php-1.30.0-wmf.16/skins/MinervaNeue/resources/skins.minerva.icons.images.scripts/watch.svg: sorta impromptu swat thing (duration: 00m 47s) [17:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:07] jdlrobson: Live errywhur ^ [17:55:14] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:55:17] w00t [17:55:40] RainbowSprinkles: watchstar is back https://m.mediawiki.org/wiki/Extension:MobileFrontend [17:56:59] Hehe you can see the errors disappear clearly https://usercontent.irccloud-cdn.com/file/zJ1cKig0/bye-bye-bye.png [17:57:18] RainbowSprinkles: :) :) [18:00:01] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [18:00:01] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.009 second response time [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T1800). Please do the needful. [18:00:55] (03Draft2) 10Reedy: Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) [18:01:56] (03PS3) 10Reedy: Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) [18:09:24] jouncebot, okay, I'll do SWAT. [18:15:30] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3570981 (10aaron) >>! In T173710#3570037, @Joe wrote: > Correcting myself after a discussion with @ema: since we have up to 4 cache layers (at most), we... [18:17:24] (03CR) 10Eevans: [C: 031] "Also (now) includes restbase2001; Puppet output [here](http://puppet-compiler.wmflabs.org/7672/)." [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [18:18:14] PROBLEM - Host druid1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:43] RECOVERY - Host druid1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:21:17] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3560205 (10Eevans) @Cmjohnson Can you confirm whether we have spare Samsung drives for this in inventory? Do you have an ETA on replacement, so that we know how to plan on our end? [18:22:10] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545519 (10Legoktm) Could we always bump page_touched, but only send the purges to varnish if the timestamp is within the past four days? Would that let... [18:23:23] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89953.22 seconds [18:23:43] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3571023 (10Cmjohnson) @Eevans no we do not. Do you want it fixed or a disk ordered? I may have misunderstood what you meant by decommission. [18:24:41] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3571024 (10Cmjohnson) looping @robh in to order a new disk. [18:26:42] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3571028 (10EBernhardson) With the refresh links problem looking mostly resolved, the remaining top queues in the job queue (as of aug 31, 1am UTC): ```... [18:27:11] (03PS1) 10Niharika29: Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 [18:28:26] (03CR) 10Jforrester: [C: 031] Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29) [18:29:10] (03PS3) 10Mattflaschen: Fix interwiki links on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) [18:29:32] (03CR) 10Mattflaschen: [C: 032] Fix interwiki links on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) (owner: 10Mattflaschen) [18:31:27] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3571046 (10EBernhardson) >>! In T173710#3571009, @Legoktm wrote: > Could we always bump page_touched, but only send the purges to varnish if the timestam... [18:32:30] (03Merged) 10jenkins-bot: Fix interwiki links on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) (owner: 10Mattflaschen) [18:32:44] (03CR) 10jenkins-bot: Fix interwiki links on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) (owner: 10Mattflaschen) [18:33:46] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3560205 (10GWicke) No Samsung spares would be surprising, given our last conversation on the topic in April, and from what I remember about the stock back then. [18:33:51] (03PS4) 10Eevans: Use absolute paths for `data_file_directories` [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) [18:33:53] (03PS1) 10Eevans: Instance-configurable `heapdump_directory` [puppet] - 10https://gerrit.wikimedia.org/r/375048 (https://phabricator.wikimedia.org/T169939) [18:34:47] (03PS6) 10Smalyshev: Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) [18:35:11] (03CR) 10Smalyshev: Add RDF dumps for categories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [18:39:46] (03PS1) 10Eevans: Configure `disk_failure_policy: best_effort` [puppet] - 10https://gerrit.wikimedia.org/r/375049 (https://phabricator.wikimedia.org/T169939) [18:40:37] (03CR) 10Zoranzoki21: [C: 031] Fix links on highlight.php for dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375036 (https://phabricator.wikimedia.org/T174703) (owner: 10Reedy) [18:43:19] (03PS1) 10Mattflaschen: Fix interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375051 [18:43:28] (03CR) 10jerkins-bot: [V: 04-1] Fix interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375051 (owner: 10Mattflaschen) [18:43:46] (03PS2) 10Mattflaschen: Fix interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375051 [18:44:03] (03CR) 10Mattflaschen: [C: 032] Fix interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375051 (owner: 10Mattflaschen) [18:44:45] (03CR) 10Eevans: [C: 031] "Updated [Puppet compiler output](http://puppet-compiler.wmflabs.org/7673/)" [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [18:45:33] (03Merged) 10jenkins-bot: Fix interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375051 (owner: 10Mattflaschen) [18:45:58] (03CR) 10jenkins-bot: Fix interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375051 (owner: 10Mattflaschen) [18:48:48] (03PS6) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [18:49:15] (03PS2) 10Eevans: Instance-configurable `heapdump_directory` [puppet] - 10https://gerrit.wikimedia.org/r/375048 (https://phabricator.wikimedia.org/T169939) [18:49:17] (03PS2) 10Eevans: Configure `disk_failure_policy: best_effort` [puppet] - 10https://gerrit.wikimedia.org/r/375049 (https://phabricator.wikimedia.org/T169939) [18:50:17] (03CR) 10Thcipriani: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [18:51:36] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3571127 (10Cmjohnson) @madhuvishy I reseated the controller card, removed all cabling, and started over again. This time around, the controller c... [18:51:51] !log mattflaschen@tin Synchronized wmf-config: Interwiki links on Beta Cluster: T69931 (duration: 00m 49s) [18:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:06] T69931: Beta should not use productions interwiki.php - https://phabricator.wikimedia.org/T69931 [18:52:35] (03PS7) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [18:53:11] !log mattflaschen@tin Synchronized docroot/noc/conf/interwiki-labs.php.txt: Interwiki links on Beta Cluster: T69931 (duration: 00m 47s) [18:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:30] (03CR) 10Niharika29: "Copied the scap::sources bit from to hieradata/labs/deployment-prep/common.yaml in last patch." [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [18:55:51] (03PS3) 10Eevans: Instance-configurable `heapdump_directory` [puppet] - 10https://gerrit.wikimedia.org/r/375048 (https://phabricator.wikimedia.org/T169939) [18:55:53] (03PS3) 10Eevans: Configure `disk_failure_policy: best_effort` [puppet] - 10https://gerrit.wikimedia.org/r/375049 (https://phabricator.wikimedia.org/T169939) [18:57:27] greg-g, RainbowSprinkles, I'll be a few minutes late finishing SWAT. [18:58:50] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3571138 (10Eevans) >>! In T174392#3571023, @Cmjohnson wrote: > @Eevans no we do not. Do you want it fixed or a disk ordered? I may have misunderstood what you meant by decommissio... [19:00:05] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T1900). Please do the needful. [19:00:17] matt_flaschen: SORRY TIME IS UP I SHALL END ALL YOUR SCAPS RIGHT NOW [19:00:24] (jk, go ahead, I'll just have another coffee) [19:00:34] (03CR) 10Dmaza: Migrate AbuseFilter config off wmg variables, part 1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem) [19:00:49] RainbowSprinkles, thanks, I also just realized I have to scap (the old meaning of scap). [19:02:59] full scaps aren't scary anymore :) [19:03:15] Best case scenario, I've gotten them down to about 15 minutes. [19:03:15] RainbowSprinkles, okay, cool, was about to ask you that. How long does it take now, docs still say 30-60? [19:03:22] Thanks [19:03:35] Granted, with enough to change or redo with l10n, you can take 45mins [19:03:41] But the *ideal* scenario is fast now [19:04:14] (03CR) 10Thcipriani: [C: 031] "> Copied the scap::sources bit from to hieradata/labs/deployment-prep/common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [19:04:42] RainbowSprinkles, okay, tried to update https://wikitech.wikimedia.org/wiki/How_to_deploy_code#More_complex_changes:_sync_everything . [19:05:46] (03CR) 10Eevans: [C: 031] "[Puppet compiler output](http://puppet-compiler.wmflabs.org/7677/)" [puppet] - 10https://gerrit.wikimedia.org/r/375048 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [19:05:50] !log mattflaschen@tin Started scap: Watchlist filters: Convert edit watchlist button to new UX and fix server-side tag filtering. T172030 [19:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:04] T172030: Integrate Watchlist-management links and page info into the new UX - https://phabricator.wikimedia.org/T172030 [19:06:15] RainbowSprinkles, it's just one message this time, so hopefully the 15. [19:06:45] 15 is syncing everything, but no l10n rebuild IIRC [19:07:01] But if you've only changed en and qqq... It shouldn't be too bad [19:07:03] Yeah. Also implies a /very/ clean /srv/mediawiki-staging/ [19:07:13] (working on fixing that even more) [19:10:05] (03CR) 10Eevans: [C: 031] "[Puppet compiler output](http://puppet-compiler.wmflabs.org/7678/)" [puppet] - 10https://gerrit.wikimedia.org/r/375049 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [19:17:10] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3571201 (10Eevans) Open changesets: - https://gerrit.wikimedia.org/r/372469 Use fully qualified `data_file_directories` (incl. configuration of 20... [19:17:19] !log powercycled mw1294 [19:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:15] (03PS1) 10Chad: Group2 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375059 [19:27:14] !log mattflaschen@tin Finished scap: Watchlist filters: Convert edit watchlist button to new UX and fix server-side tag filtering. T172030 (duration: 21m 23s) [19:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:26] T172030: Integrate Watchlist-management links and page info into the new UX - https://phabricator.wikimedia.org/T172030 [19:30:27] RainbowSprinkles, I did a full scap, still seeing a missing message at https://www.mediawiki.org/wiki/Special:Watchlist?rcfilters=1&hidepreviousrevisions=1&hidecategorization=1&hideWikibase=1&limit=250&days=30&urlversion=2&tagfilter=visualeditor . [19:30:43] RainbowSprinkles, some kind of caching issue? [19:30:57] Shouldn't be [19:31:15] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3571236 (10Jgreen) We started testing on mintaka, here's what changed: 1) @ayounsi enabled ge-0/0/11 and added description mintaka:eth1 2) @jgreen did (by... [19:31:51] It's dark-launched, but still want to resolve this. [19:33:01] https://www.mediawiki.org/w/load.php?debug=false&lang=en&modules=mediawiki.rcfilters.filters.ui shows it, but I don't know why. [19:33:12] Shows it wrong [19:33:15] "rcfilters-watchlist-editWatchlist-button" [19:33:20] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3571237 (10Jgreen) @ayounsi also enabled and labeled frdb2001/eth1's switch port. [19:33:31] https://www.mediawiki.org/wiki/MediaWiki:Rcfilters-watchlist-editWatchlist-button [19:33:35] MW knows it [19:34:05] well if it's in MW, it probably has to do with https://wikitech.wikimedia.org/wiki/How_to_deploy_code#ResourceLoader_and_l10n_messages [19:34:07] Yeah, could be server-side RL cache pollution from me viewing the wrong version on mwdebug1002 (before scap sync) [19:34:24] Thanks, thcipriani [19:34:42] yw, hopefully that's helpful :) [19:35:29] "wait" [19:36:14] Now it's right at https://www.mediawiki.org/w/load.php?debug=false&lang=en&modules=mediawiki.rcfilters.filters.ui , but not in Chrome, despite Empty cache and hard reload [19:36:47] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3571239 (10GWicke) I updated https://gerrit.wikimedia.org/r/#/c/295027/ to apply on current master. This removes CDN purges from HTMLCacheUpdate, and onl... [19:37:08] Reedy, okay, cleared localStorage, good now. [19:37:54] It works in one tab of Chromium but not the other? [19:38:09] computers suck [19:38:15] :) [19:38:31] Reedy, okay, you're good to go. I'm sorry for the delay. [19:38:37] It's working both tabs now [19:39:30] (03CR) 10MaxSem: Migrate AbuseFilter config off wmg variables, part 1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem) [19:40:42] thcipriani, I need to do it on every wiki? I guess that makes sense, these two should be the only ones messed up, since I didn't visit it anywhere else. [19:42:07] hrm, not sure if you need to clear multiple wikis... [19:44:04] thcipriani, it still doesn't work for me even after I run it specifically on meta. :( [19:44:07] thcipriani, https://meta.wikimedia.org/w/load.php?debug=false&lang=en&modules=mediawiki.rcfilters.filters.ui [19:46:08] (03CR) 10Dmaza: [C: 031] Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem) [19:46:29] stephanebisson, it's working everywhere except Meta AFAICT, not on most Wikipedias yet due to train. But working on Catalan Wikipedia [19:46:59] 10Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3571269 (10TheDJ) I have now tested this and the preview does indeed implement origin-when-cross-origin as directed by our meta referrer header. [19:47:24] matt_flaschen: ok, thanks. I have to wait for the train, en.wp is where its gonna be interesting [19:47:46] Okay, I guess it was Varnish. Working everywhere now (except for later wikis in train). ^ stephanebisson [19:48:31] stephanebisson, see also thcipriani's note about ResourceLoader l10n caching: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#ResourceLoader_and_l10n_messages . I think you ran into this a little while ago. [19:49:33] I see that once in a while locally, it may be the same problem, reboot is what fixes it for me [19:49:51] Hmm, in that case it might be enough to clear localStorage and sessionStorage. [19:50:05] stephanebisson, sometimes copying to Beta Cluster is also helpful (avoid waiting for prod). Have you seen where gadgets are defined (to know what to copy)? [19:50:22] stephanebisson, and do you have admin on Beta Cluster? [19:50:43] matt_flaschen: I'm not admin [19:51:21] stephanebisson, do you want me to make you one? [19:51:48] matt_flaschen: and I'm not sure how to track where the gadgets in enwp > Preferences > Gardgets > Watchlist are being defined [19:51:52] (03PS1) 10Cmjohnson: Adding mac addresses for mw1307-1328 T165519 [puppet] - 10https://gerrit.wikimedia.org/r/375069 [19:51:56] matt_flaschen: sure [19:52:06] stephanebisson, https://en.wikipedia.org/wiki/MediaWiki:Gadgets-definition . [19:52:08] can't hurt, unless I break everything, of course [19:52:36] And the names there are pages in the gadgets ns? [19:53:11] Can I train yet? [19:53:56] RainbowSprinkles, yeah. Oops, I pinged Reedy instead of you by mistake above. [19:54:09] (03PS3) 10Chad: Migrate AbuseFilter config off wmg variables, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem) [19:54:23] (03CR) 10Chad: "Crap sorry, did not mean to rebase. Was looking at the wrong tab :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374651 (owner: 10MaxSem) [19:55:27] stephanebisson, Gadget namespace is not used in prod yet, though I just saw it exists (and there is even a redirect to an article there), which is not good. [19:55:30] stephanebisson, the current scheme is: [19:55:35] MediaWiki:Gadget- [19:55:40] and then the filename given at that page. [19:55:51] (03CR) 10Chad: [C: 032] Group2 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375059 (owner: 10Chad) [19:55:56] stephanebisson, e.g. watchlist-notice.js => https://en.wikipedia.org/wiki/MediaWiki:Gadget-watchlist-notice.js [19:56:17] And before the first bracket ([) is the module name. [19:56:44] RainbowSprinkles, sorry to hold up the train. [19:57:22] (03Merged) 10jenkins-bot: Group2 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375059 (owner: 10Chad) [19:57:31] (03CR) 10jenkins-bot: Group2 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375059 (owner: 10Chad) [19:58:59] (03PS2) 10Cmjohnson: Adding mac addresses for mw1307-1328 T165519 [puppet] - 10https://gerrit.wikimedia.org/r/375069 [19:59:42] 10Operations: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571294 (10herron) [20:00:16] (03CR) 10Cmjohnson: [C: 032] Adding mac addresses for mw1307-1328 T165519 [puppet] - 10https://gerrit.wikimedia.org/r/375069 (owner: 10Cmjohnson) [20:01:53] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [20:01:54] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3571312 (10Cmjohnson) [20:02:24] 10Operations: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571294 (10Reedy) I'm not sure exactly how we do LE.. But some client implementations they can spin up a standalone webserver to make the request, and then shut itself down... [20:03:16] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided) [20:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:08] 10Operations, 10Mail: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3571331 (10herron) [20:04:10] 10Operations: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571332 (10herron) [20:04:11] stephanebisson, what username should I add admin to? [20:04:32] Or I can create one. [20:05:24] Interesting. Bunch of commonswiki entires complaining about pcache entries being too large for memcached. [20:05:54] LangSwitch comes to mind. [20:08:27] langswitch presumably wouldn't cause a large pcache entry, because its only large pre-parse. I would guess very large galleries are the culprit. some of them can be huge [20:09:03] matt_flaschen: Sbisson-beta [20:11:27] bawolff: https://logstash.wikimedia.org/goto/105ec328787aa4a5519db1c04314388c [20:12:19] stephanebisson, added. [20:13:29] /wiki/Emoji/Table [20:13:52] betting that's a table containing all 5 billion emoji [20:14:04] (03PS1) 10Herron: WIP: Add standalone letsencrypt nginx template [puppet] - 10https://gerrit.wikimedia.org/r/375071 (https://phabricator.wikimedia.org/T174720) [20:14:25] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add standalone letsencrypt nginx template [puppet] - 10https://gerrit.wikimedia.org/r/375071 (https://phabricator.wikimedia.org/T174720) (owner: 10Herron) [20:14:42] and /wiki/User:Bernd_Schwabe_in_Hannover/gallery [20:15:28] bawolff: I wonder if we can detect the too-large condition prior to trying to save it [20:15:40] How big is too large anyways? [20:15:42] (03PS1) 10Dmaza: Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) [20:16:05] bawolff: I suppose when $size > $max_size [20:16:06] ;-) [20:16:29] Did we not have some sort of limit to wiki page sizes? [20:16:32] google says 1 mb is the limit [20:16:38] Niharika: We do :) [20:16:52] But the parser cache entry might...be smaller than that size? [20:16:52] but its on raw wikitext size, not the size of the resulting html [20:16:54] Right [20:17:00] Ah. [20:17:23] Er....the max size might be smaller than resulting parser cache entry size [20:17:34] although there are other parser limits, and I don't know what all of them do. Some of them might relate to html size [20:20:42] $wgMaxArticleSize is also currently 2 MB, and the memcache limit is 1 mb, so there is also that [20:22:20] (03PS2) 10Dmaza: Enable AbuseFilter runtime profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375072 (https://phabricator.wikimedia.org/T161059) [20:24:09] according to google, the limit in memcached is also now configurable [20:24:21] (03PS2) 10Madhuvishy: new *.wmflabs.org certificate for cert expiry on 2017-10-16 [puppet] - 10https://gerrit.wikimedia.org/r/374873 (https://phabricator.wikimedia.org/T174053) (owner: 10RobH) [20:24:49] bawolff: Wonder if we do configure it, or if we use defaults [20:25:17] Also: is raising the limit correct? Stuffing giant stuff in memcached sounds like a recipe for sadness. [20:25:18] (03CR) 10Madhuvishy: [C: 032] new *.wmflabs.org certificate for cert expiry on 2017-10-16 [puppet] - 10https://gerrit.wikimedia.org/r/374873 (https://phabricator.wikimedia.org/T174053) (owner: 10RobH) [20:26:00] yeah, I have no idea. Does seem like a bad solution beyond a certain point [20:27:54] 10Operations, 10Patch-For-Review: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571393 (10herron) https://gerrit.wikimedia.org/r/375071 hopefully gives the gist of an approach where systems without an existing webserver use a simple nginx site that 403s everyth... [20:58:02] 10Operations, 10Patch-For-Review: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571491 (10herron) @Reedy interesting, I wonder if the client we currently use supports this? [21:00:35] 10Operations, 10Patch-For-Review: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571507 (10Reedy) What client do we use? https://certbot.eff.org/docs/using.html#standalone [21:05:36] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3571512 (10aaron) >>! In T173710#3571046, @EBernhardson wrote: >>>! In T173710#3571009, @Legoktm wrote: >> Could we always bump page_touched, but only se... [21:08:52] (03PS1) 10RobH: labstore100[67] install params [puppet] - 10https://gerrit.wikimedia.org/r/375079 (https://phabricator.wikimedia.org/T167984) [21:09:18] (03CR) 10jerkins-bot: [V: 04-1] labstore100[67] install params [puppet] - 10https://gerrit.wikimedia.org/r/375079 (https://phabricator.wikimedia.org/T167984) (owner: 10RobH) [21:10:53] (03PS2) 10RobH: labstore100[67] install params [puppet] - 10https://gerrit.wikimedia.org/r/375079 (https://phabricator.wikimedia.org/T167984) [21:11:16] (03CR) 10jerkins-bot: [V: 04-1] labstore100[67] install params [puppet] - 10https://gerrit.wikimedia.org/r/375079 (https://phabricator.wikimedia.org/T167984) (owner: 10RobH) [21:12:12] haha, stupid typo deletion =P [21:12:44] (03PS3) 10RobH: labstore100[67] install params [puppet] - 10https://gerrit.wikimedia.org/r/375079 (https://phabricator.wikimedia.org/T167984) [21:13:38] (03CR) 10RobH: [C: 032] labstore100[67] install params [puppet] - 10https://gerrit.wikimedia.org/r/375079 (https://phabricator.wikimedia.org/T167984) (owner: 10RobH) [21:20:10] 10Operations, 10cloud-services-team (Kanban): update *.wmflabs.org by 2017-10-16 - https://phabricator.wikimedia.org/T174611#3571544 (10madhuvishy) 05Open>03Resolved a:03madhuvishy This is all done, new private key committed in ops/private. New certs are showing up okay! [21:24:09] 10Operations, 10Patch-For-Review: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3571294 (10Krenair) acme_tiny - modules/letsencrypt/files/acme_tiny.py in puppet [21:26:35] (03CR) 10Alex Monk: WIP: Add standalone letsencrypt nginx template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375071 (https://phabricator.wikimedia.org/T174720) (owner: 10Herron) [21:35:14] (03PS2) 10Madhuvishy: Add me back to deployment-prep shinken contacts [puppet] - 10https://gerrit.wikimedia.org/r/374866 (owner: 10Alex Monk) [21:35:35] (03CR) 10Madhuvishy: [C: 032] Add me back to deployment-prep shinken contacts [puppet] - 10https://gerrit.wikimedia.org/r/374866 (owner: 10Alex Monk) [21:47:17] (03PS1) 10Rush: labpuppetmaster: add back the wmcs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/375084 [21:51:35] (03PS1) 10BryanDavis: planet: add Wikimedia Readers blog [puppet] - 10https://gerrit.wikimedia.org/r/375085 [21:52:56] (03PS2) 10Madhuvishy: Add centralnotice tables to maintain-views.yaml [puppet] - 10https://gerrit.wikimedia.org/r/374875 (https://phabricator.wikimedia.org/T135405) (owner: 10Reedy) [21:53:05] (03CR) 10Madhuvishy: [C: 032] Add centralnotice tables to maintain-views.yaml [puppet] - 10https://gerrit.wikimedia.org/r/374875 (https://phabricator.wikimedia.org/T135405) (owner: 10Reedy) [21:59:03] AndyRussG: Hi! If you are around can you confirm T135405 is good to go? :) [21:59:03] T135405: Replicate CentralNotice tables to Labs - https://phabricator.wikimedia.org/T135405 [22:09:27] madhuvishy: hi!! I think Reedy and ejegg beat me to it? Or is there more? thx!!!! [22:09:40] AndyRussG: they did :) thank you! [22:22:27] (03PS5) 10Ottomata: [WIP] Initial commit of certpy [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [22:22:27] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [22:33:28] PROBLEM - salt-minion processes on thorium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:34:17] PROBLEM - Check systemd state on thorium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:34:29] (03PS1) 10Alex Monk: [WIP] keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) [22:34:37] PROBLEM - pivot on thorium is CRITICAL: connect to address 10.64.53.26 and port 9090: Connection refused [22:34:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) (owner: 10Alex Monk) [22:36:37] RECOVERY - pivot on thorium is OK: TCP OK - 0.000 second response time on 10.64.53.26 port 9090 [22:36:47] PROBLEM - Disk space on ms-be2023 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error [22:38:09] (03PS2) 10Alex Monk: [WIP] keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) [22:39:04] (03CR) 10Mattflaschen: "Followup fix is https://gerrit.wikimedia.org/r/#/c/375051/ ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374922 (https://phabricator.wikimedia.org/T69931) (owner: 10Mattflaschen) [22:40:18] RECOVERY - Check systemd state on thorium is OK: OK - running: The system is fully operational [22:40:38] RECOVERY - salt-minion processes on thorium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:41:00] (03PS1) 10RobH: labstore100[67] partman recipe tweak [puppet] - 10https://gerrit.wikimedia.org/r/375090 (https://phabricator.wikimedia.org/T167984) [22:42:01] (03PS2) 10RobH: labstore100[67] partman recipe tweak [puppet] - 10https://gerrit.wikimedia.org/r/375090 (https://phabricator.wikimedia.org/T167984) [22:42:41] (03CR) 10RobH: [C: 032] labstore100[67] partman recipe tweak [puppet] - 10https://gerrit.wikimedia.org/r/375090 (https://phabricator.wikimedia.org/T167984) (owner: 10RobH) [22:45:17] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdc1] [22:46:07] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:5 - OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [22:47:44] (03CR) 10Alex Monk: "It actually conflicts with that change being merged, leaving puppet on deployment-ms-be0[34] broken for a week due to the git conflict mar" [puppet] - 10https://gerrit.wikimedia.org/r/371582 (owner: 10Filippo Giunchedi) [22:51:44] (03PS1) 10RobH: forgotten sda/b/c change for recipe [puppet] - 10https://gerrit.wikimedia.org/r/375092 [22:52:05] (03CR) 10RobH: [C: 032] forgotten sda/b/c change for recipe [puppet] - 10https://gerrit.wikimedia.org/r/375092 (owner: 10RobH) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170831T2300). Please do the needful. [23:00:32] bd808: Merge https://gerrit.wikimedia.org/r/#/c/375046/? :P [23:01:16] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [23:04:25] (03CR) 10Alex Monk: "Your cherry-pick also included some submodule changes:" [puppet] - 10https://gerrit.wikimedia.org/r/371582 (owner: 10Filippo Giunchedi) [23:26:09] (03PS1) 10RobH: grub install tweak [puppet] - 10https://gerrit.wikimedia.org/r/375093 [23:26:34] (03CR) 10RobH: [C: 032] grub install tweak [puppet] - 10https://gerrit.wikimedia.org/r/375093 (owner: 10RobH) [23:40:02] Niharika: nice. I'll get "soon" [23:40:29] 👍 [23:41:36] Niharika: you wouldn't happen to want to become the maintainer of that bot would you? [23:42:04] bd808: I'd be happy to. :) [23:42:28] cool! I'll give you rights and you can do the needful :) [23:42:40] * greg-g awards Niharika wikilove [23:42:40] Tip for future: You should phrase that like "you would want to become the maintainer for that bot, won't you?" :P [23:43:04] :) [23:44:21] Niharika: you are in the maintainers list now. If you are logged into tools-login you'll need to log out and back in for it to notice [23:44:31] Docs at https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [23:44:37] Gotcha. [23:44:57] and you should already have +2 from core on the repo [23:45:09] We need to train it to tell goat jokes at random, to begin with. [23:45:16] * Niharika nods [23:45:19] +1 [23:45:38] the "funny" messages are pretty stale at this point too [23:45:56] Yep. I shall fix that. [23:47:17] (03CR) 10BryanDavis: Avoid pinging deployers unless there are patches to be deployed (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29) [23:53:40] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3572019 (10RobH) Ok, so these detect with the internal raid1 as sdb, the internal raid10 array as sdc, and the external disk array as sda. d-i grub-i... [23:53:54] "A database query error has occurred. This may indicate a bug in the software" [23:54:13] "Waig1QpAMFcAAL1S3h0AAAAV] 2017-08-31 23:51:45: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"" [23:54:15] Hmm [23:54:29] Bsadowski1: Where? [23:54:59] On Simple Wikipedia (simple.wikipedia.org) under https://simple.wikipedia.org/w/index.php?title=Special:Contributions/newbies [23:55:14] Bsadowski1: Works for me. [23:59:12] Bsadowski1: I looked it up in logstash. looks like it was just a database read delay blip. the query timed out [23:59:29] "Error: 2062 Read timeout is reached"