[01:55:25] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1905677 (10Tgr) [02:23:50] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 55s) [02:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Dec 28 02:30:47 UTC 2015 (duration 6m 58s) [02:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:38:52] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [05:04:17] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:11:33] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [06:26:32] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: puppet fail [06:31:03] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [06:32:38] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:17] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:19] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:39:37] PROBLEM - puppet last run on mw2035 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:37] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:18] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:08] RECOVERY - puppet last run on mw2035 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:13:06] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: Puppet has 1 failures [07:17:10] 6operations, 6Performance-Team, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1905775 (10Nemo_bis) [07:17:27] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1895546 (10Nemo_bis) [07:20:37] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:38:07] (03CR) 10Giuseppe Lavagetto: [C: 032] Use system-wide etcd configurations for the etcd driver [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 (owner: 10Giuseppe Lavagetto) [07:38:52] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:39:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 (owner: 10Giuseppe Lavagetto) [07:40:19] (03CR) 10Giuseppe Lavagetto: [C: 032] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/258981 (owner: 10Giuseppe Lavagetto) [07:41:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Made locking optional as it might slow down syncing significantly [software/conftool] - 10https://gerrit.wikimedia.org/r/259492 (owner: 10Giuseppe Lavagetto) [08:25:51] _joe_: it seems we need someone from ops to use https://phabricator.wikimedia.org/p/admin/ and disable a spammer on phabricator [08:27:01] <_joe_> Nemo_bis: I guess I don't have the credentials, but let me check [08:27:07] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [08:27:10] <_joe_> ops != phab admin [08:27:41] _joe_: I know but the admins are offline and that admin account mentions ops [08:28:07] I'll document the result of your investigation in the wiki so that you're not asked again in the future :) [08:28:26] <_joe_> well, my investigation might very well be wrong :P [08:28:44] nah [08:31:56] <_joe_> I can't find any credential for that account anywhere, so I guess it's not an ops-controlled account at all [08:52:51] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:02:13] _joe_: ok thanks [09:09:55] 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1905868 (10Aklapper) a:3hashar [10:30:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 632 [10:38:56] (03PS1) 10Giuseppe Lavagetto: Re-think the alerts instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/261183 [10:50:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 699 [10:55:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 999 [11:00:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 952 [11:05:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1251 [11:10:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1552 [11:15:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 585503 Threads: 152 Questions: 23428850 Slow queries: 7000 Opens: 52775 Flush tables: 2 Open tables: 416 Queries per second avg: 40.014 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:45:25] (03PS1) 10Jcrespo: Emergency depool of db1050 for cloning to to db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261190 (https://phabricator.wikimedia.org/T105879) [12:51:46] (03PS1) 10Jcrespo: Reconfiguring db1050 after cloning [puppet] - 10https://gerrit.wikimedia.org/r/261191 (https://phabricator.wikimedia.org/T105879) [12:57:02] (03CR) 10Jcrespo: [C: 04-2] "Until depooled and checked." [puppet] - 10https://gerrit.wikimedia.org/r/261191 (https://phabricator.wikimedia.org/T105879) (owner: 10Jcrespo) [13:07:31] PROBLEM - salt-minion processes on bohrium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:22:02] RECOVERY - salt-minion processes on bohrium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:39:52] (03PS1) 10Giuseppe Lavagetto: Add warning for badly configured pools. [debs/pybal] - 10https://gerrit.wikimedia.org/r/261193 [14:12:19] (03CR) 10Jcrespo: [C: 032] Emergency depool of db1050 for cloning to to db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261190 (https://phabricator.wikimedia.org/T105879) (owner: 10Jcrespo) [14:15:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Emergency depool of db1050 (duration: 00m 31s) [14:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:49] (03PS1) 10Faidon Liambotis: admin: add faidon's ed25519 key + Yubikey RSA key [puppet] - 10https://gerrit.wikimedia.org/r/261194 [14:22:03] (03CR) 10Faidon Liambotis: [C: 032] admin: add faidon's ed25519 key + Yubikey RSA key [puppet] - 10https://gerrit.wikimedia.org/r/261194 (owner: 10Faidon Liambotis) [14:22:44] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:52] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures [14:36:17] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 966199 bytes in 6.296 second response time [14:37:32] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:19] uhm [14:38:24] anyone doing anything with ganglia? [14:38:39] <_joe_> not me [14:38:50] Dec 28 14:30:35 hooft puppet-agent[31111]: (/Stage[main]/Ganglia::Monitor::Packages/Package[ganglia-monitor]/ensure) change from 3.5.0-wm1 to 3.3.5-2~wmf2 failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install ganglia-monitor' returned 100: Reading package lists... [14:38:55] what the f* [14:42:34] ok fixed [14:51:20] (03PS1) 10ArielGlenn: on salt master increase network settings for packet backlog and memory [puppet] - 10https://gerrit.wikimedia.org/r/261195 [14:53:09] (03CR) 10ArielGlenn: [C: 032] on salt master increase network settings for packet backlog and memory [puppet] - 10https://gerrit.wikimedia.org/r/261195 (owner: 10ArielGlenn) [14:55:11] !log cloning db1050's mysql data to db1022 [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:12] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 17.39% of data above the critical threshold [100000000.0] [14:59:11] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:59:59] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1906145 (10ArielGlenn) Upped two more network settings, the packet queue length and some memory limits: https://gerrit.wikimedia.org/r/#/c/261195/ We now consistently see returns from salt '*'... [15:03:16] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:16:31] (03PS1) 10Cmjohnson: Adding mgmt dns for pc1004-pc1006 bug: task# T121888 [dns] - 10https://gerrit.wikimedia.org/r/261197 [15:19:19] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for pc1004-pc1006 bug: task# T121888 [dns] - 10https://gerrit.wikimedia.org/r/261197 (owner: 10Cmjohnson) [15:24:16] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [15:27:26] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [15:27:28] Coren, hi [15:34:21] o/ [15:34:28] Hola, Krenair [15:47:11] Coren, so when do we want to do that wiki creation? [15:47:37] I suppose I could do it in the SWAT window as no one else has anything [15:47:39] Krenair: Whenever the next window is, I suppose. [15:47:49] Heh. GMTA [15:54:47] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151228T1600). [16:00:04] Krenair: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:20] ok [16:01:04] new wiki? [16:01:28] ugh, wtf? there's an issue [16:01:31] yes jynus [16:01:50] :-), my real question is, when? [16:02:02] I'm trying to do it now jynus [16:02:07] thanks [16:09:16] a bunch of wikidata stuff is broken [16:10:15] (03CR) 10Alex Monk: [C: 032] Add wikimania2017.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [16:11:01] (03Merged) 10jenkins-bot: Add wikimania2017.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260521 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [16:11:54] hm, need wikiversions too [16:12:07] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/260521/ (duration: 00m 30s) [16:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:42] !log krenair@tin Synchronized w/static/images/project-logos/wikimania2017wiki.png: (no message) (duration: 00m 31s) [16:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:33] !log krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [16:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:08] !log krenair@tin Synchronized dblists: (no message) (duration: 00m 29s) [16:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:26] I will apply the filters now that I can see it [16:14:58] some database table creation did not work [16:15:55] same issue than last time? I thought it was fixed [16:16:20] oh...I missed a step [16:17:21] do not worry, the process is not precisely streamlined, you are already a saint to my eyes for doing it [16:17:37] okay, wiki is up [16:18:01] had to do extensions/Translate/sql/revtag.sql [16:18:07] revtag, was the issue... I see [16:18:25] I see it in the logs, did not know if it was that [16:18:36] thank you, Krenair! [16:18:42] there's still stuff to clean up [16:18:49] wikidata broke some things, like https://wikimania2017.wikimedia.org/wiki/MediaWiki:Sitesupport-url is corrupted [16:19:25] I do not even know what's that! [16:20:22] ah, it is the bar ilnk, isn't it? [16:21:19] let me first handle the db things, as that is something I need to do myself [16:22:24] Krenair: How do you normally handle the original 'crats? The only wiki I created in the past I handled that with my staff account but it feels a bit funny for me to use mine to give bits to my volunteer account. [16:23:33] seeing as this is a normal public wiki, I'd ask the stewards [16:23:51] * Coren blushes. [16:24:00] I... forgot about the stewards. Shame on me. :-) [16:24:00] yes, please don't use your staff account to grant your volunteer account rights [16:24:43] Krenair: Clearly. I meant "funny" is the "not right" sense not in the "ha-ha" sense. :-) [16:24:54] !log krenair@tin Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 30s) [16:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:02] Krenair: Also, last time was a fishbowl which is a different kettle of fisk. [16:25:08] Yes [16:25:19] Coren, I remember 2 tickets with new columns available on labs, can you remember which where those [16:25:30] I want to check those are allowed [16:25:46] Because the first account creation on those fishbowl wikis has to be by createAndPromote.php, IIRC [16:26:20] jynus: Not offhand, but I should be able to find them quickly. Gimme a sec. [16:26:38] I found one, the one open: https://phabricator.wikimedia.org/T59617 [16:26:57] there was one recently (maybe some moths ago) closed [16:27:13] jynus: https://phabricator.wikimedia.org/T60196 is the one you're thinking of [16:27:56] yes, the child of that to be exact, https://phabricator.wikimedia.org/T92841 Thank you very much! [16:31:53] unsusprisingly, the closed task is done and the open one is pending :-) [16:33:58] https://wikimania2017.wikimedia.org/wiki/MediaWiki:Sitesupport-url looks broken [16:34:37] see above ;p [16:34:55] they prob alreeady know ^^ [16:35:16] and that is a non-trivial change due to replication filters, so I will have to delay it, I will apply the current filters for now [16:35:32] Yes, I know Glaisher [16:35:35] I am busy fixing it [16:35:59] oh ok [16:37:57] I filed tickets against wikidata [16:38:57] !log applying production-side replication filters for wikimania2017wiki on labs [16:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:31] 6operations, 10Traffic: Varnish leaks memory - https://phabricator.wikimedia.org/T122455#1906280 (10BBlack) I peeked around at this as well this morning. Surveying all the cache_upload machines, some of them have notably more memory consumed by the varnish frontend than others, and it seems to be related to o... [16:48:12] Nemo_bis: I'm on the clock atm, so I don't really have time to work on the wiki but if you want to give a hand, I'll gladly accept. :-) [16:49:10] I'm not accepting new commitments for the next few months [16:49:30] Nemo_bis: :-) [16:50:28] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1906287 (10jcrespo) Production-side filters of private info added and manually checked, this can be added to labs at any time. [16:58:54] Okay [16:58:58] So I reinserted the text entry [16:59:28] (the actual blob on ES cluster24) [16:59:36] and updated the text table entry to point towards it [16:59:52] after purging cache, the revision seems to have been fixed: https://wikimania2017.wikimedia.org/w/index.php?title=MediaWiki:Sitesupport-url&oldid=2 [16:59:59] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1906314 (10Nuria) Excellent, that should be it for the time coming. [17:00:18] Glaisher, ^ [17:01:25] Krenair: Do you know why it was broken in the first place? [17:01:27] I don't remember this happening for previous creations. [17:01:32] yes [17:01:36] page edit failed [17:01:41] because wikidata required a new table [17:01:51] That someone forgot to add to addWiki, I guess [17:02:09] ah [17:03:13] wait.. it still doesn't look right https://wikimania2017.wikimedia.org/wiki/MediaWiki:Sitesupport-url [17:03:45] Is it pointing to the right revision? [17:04:01] (03PS1) 10Faidon Liambotis: varnish: (temporarily?) disable TBF [puppet] - 10https://gerrit.wikimedia.org/r/261204 [17:04:35] hmm, purging fixed it [17:04:40] alright then. thanks [17:04:45] (03CR) 10Faidon Liambotis: [C: 032 V: 032] varnish: (temporarily?) disable TBF [puppet] - 10https://gerrit.wikimedia.org/r/261204 (owner: 10Faidon Liambotis) [17:05:33] (I got logged out for some reason) [17:05:59] (03PS1) 10Alex Monk: Commit wikiversions.json and interwiki.cdb changes from wikimania2017wiki creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261205 (https://phabricator.wikimedia.org/T122062) [17:06:17] (03CR) 10Alex Monk: [C: 032] Commit wikiversions.json and interwiki.cdb changes from wikimania2017wiki creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261205 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [17:06:37] also need to do restbase addition, probably some parsoid update too [17:06:49] (03Merged) 10jenkins-bot: Commit wikiversions.json and interwiki.cdb changes from wikimania2017wiki creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261205 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [17:11:19] (03PS1) 10Alex Monk: Add RESTBase and labs DNS config for wikimania2017wiki [puppet] - 10https://gerrit.wikimedia.org/r/261206 (https://phabricator.wikimedia.org/T122062) [17:15:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:16:07] !log disabled varnish TBF and force-ran puppet on all cp* hosts (I12ea52165e125aaf4ed779399f34cff16d5cd140) [17:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:32] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [100000000.0] [17:20:09] Changes newer than 26 seconds may not be shown in this list. replag? [17:20:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:25:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:26:21] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [17:27:03] ACKNOWLEDGEMENT - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) Jeff_Green dealing [17:39:38] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [17:42:24] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906420 (10Nuria) Amire80: Are you aware of the report updater code running on 1002, you... [17:55:25] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1906442 (10jcrespo) This has been commented on the weekly meeting, and while there was positivity about providing higher privileges fo... [17:56:02] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906445 (10Amire80) 3NEW [17:57:49] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906454 (10Krenair) [17:59:47] (03PS2) 10Jcrespo: Reconfiguring db1050 after cloning [puppet] - 10https://gerrit.wikimedia.org/r/261191 (https://phabricator.wikimedia.org/T105879) [18:00:07] RECOVERY - check_mysql on lutetium is OK: Uptime: 966046 Threads: 3 Questions: 44160530 Slow queries: 6785 Opens: 23841 Flush tables: 2 Open tables: 64 Queries per second avg: 45.712 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:01:59] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906472 (10Nuria) [18:02:46] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906474 (10Nuria) Some example of thetypes of queries you need to run will be great (no need of those to file for acces... [18:03:43] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906476 (10jcrespo) Do this query mysql? [18:09:35] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906497 (10Nuria) An example of the queries you are running will be most helpful. [18:14:40] (03CR) 10Jcrespo: [C: 032] Reconfiguring db1050 after cloning [puppet] - 10https://gerrit.wikimedia.org/r/261191 (https://phabricator.wikimedia.org/T105879) (owner: 10Jcrespo) [18:21:37] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906530 (10jcrespo) The idea is that on production, only *mediawiki code* runs- it... [18:28:33] !log restarting and upgrading db1050, using the fact that it is depooled [18:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:19] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906540 (10Amire80) Most of these will be for the wikishared database. When I run them on terbium, they take a few minu... [18:31:00] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906542 (10Amire80) Most of these will be for the wikishared database. When I run t... [18:37:15] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906557 (10Dzahn) [18:44:44] "localname": "Low Saxon" [18:44:50] "localname": "تۆرکجه" [18:45:02] Krenair: ^ from sitematrix, there must be something wrong [18:45:48] the localname is either in the language itself or in English , it can't be boht [18:45:52] mutante, ? [18:46:02] ah [18:46:04] what is "localname" really supposed to be? [18:46:12] !log importing wikishared from x1-master into analytics-slave and setting up replication [18:46:14] the name of the language in the language itself? [18:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:36] then it's not "Low Saxon" for nds [18:47:05] i happened to look at https://gerrit.wikimedia.org/r/#/c/261207/1/lib/config/sitematrix.json for context [18:47:26] actually I think it's in the language of the viewer [18:48:07] hmm [18:48:30] https://meta.wikimedia.org/w/api.php?action=sitematrix&uselang=en vs. [18:48:34] https://meta.wikimedia.org/w/api.php?action=sitematrix&uselang=de [18:50:34] hmm, so if the assumed viewer for sitematrix is English [18:50:50] then this change was wrong: [18:50:53] "localname": "تۆرکجه" [18:50:58] ? [18:51:17] or it needs a local name for every single language? [18:51:31] From "South Azerbaijani" to that, yeah [18:52:00] someone should file a bug for i18n people to look into [18:53:38] right, yea [18:55:37] (03CR) 10Dzahn: [C: 031] Add RESTBase and labs DNS config for wikimania2017wiki [puppet] - 10https://gerrit.wikimedia.org/r/261206 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [18:56:55] gwicke: can we merge restbase config change for that anytime (now)? [18:57:07] (03PS1) 10Jcrespo: Add wikishared to the list of replicated dbs on dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/261214 [18:59:26] (03CR) 10Jcrespo: [C: 032] Add wikishared to the list of replicated dbs on dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/261214 (owner: 10Jcrespo) [19:03:04] (03CR) 10Dzahn: [C: 04-1] Add mgmt DNS entries for pc200[4-6] Add production DNS entries for 200[4-6] Bug:T121879 (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [19:03:17] (03PS6) 10Ori.livneh: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 [19:04:26] (03CR) 10jenkins-bot: [V: 04-1] Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [19:05:21] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906626 (10jcrespo) I've just added wikishared to the analytics slave as a replicat... [19:05:57] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1906627 (10jcrespo) [19:05:57] (03PS2) 10Dzahn: puppetmaster: add facts export script [puppet] - 10https://gerrit.wikimedia.org/r/260910 (owner: 10Filippo Giunchedi) [19:06:02] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906629 (10jcrespo) [19:06:17] (03CR) 10Dzahn: [C: 032] "used it already" [puppet] - 10https://gerrit.wikimedia.org/r/260910 (owner: 10Filippo Giunchedi) [19:17:07] (03PS7) 10Ori.livneh: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 [19:17:26] (03PS2) 10Papaul: Add mgmt DNS entries for pc200[4-6] Add production DNS entries for 200[4-6] Bug:T121879 [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) [19:17:57] (03CR) 10Ori.livneh: "Rubocop was introduced at some point, hence the -1. Updated to satisfy Rubocop." [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [19:21:29] (03PS3) 10Papaul: Add mgmt DNS entries for pc200[4-6] Add production DNS entries for 200[4-6] Bug:T121879 [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) [19:26:31] (03CR) 10GWicke: [C: 031] Add RESTBase and labs DNS config for wikimania2017wiki [puppet] - 10https://gerrit.wikimedia.org/r/261206 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [19:27:00] mutante, Krenair: to apply this, we'll have to do a rolling restart after puppet has run [19:27:33] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906665 (10jcrespo) There are 3 things that you need to get access: - a signed NDA, that you... [19:29:13] gwicke: ok, i can merge it now [19:29:25] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906673 (10Nuria) I think Amire80 might alreday have acess to 1002, i seem to remember him hav... [19:32:36] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906680 (10jcrespo) @Nuria, as far as I can see, unless I am misinterpreting the permissions,... [19:32:59] (03PS2) 10Dzahn: Add RESTBase and labs DNS config for wikimania2017wiki [puppet] - 10https://gerrit.wikimedia.org/r/261206 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [19:33:36] (03CR) 10Dzahn: [C: 032] Add RESTBase and labs DNS config for wikimania2017wiki [puppet] - 10https://gerrit.wikimedia.org/r/261206 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [19:37:08] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906686 (10Amire80) Adding @Arrbee (Runa) as supervisor. I only remember running stuff on ter... [19:40:10] (03PS1) 10Jcrespo: Add amire80 to statistics-users for quering mysql analytics-slave [puppet] - 10https://gerrit.wikimedia.org/r/261217 (https://phabricator.wikimedia.org/T122524) [19:40:45] (03CR) 10Jcrespo: [C: 04-2] "Do not merge until the access request on the ticket is approved." [puppet] - 10https://gerrit.wikimedia.org/r/261217 (https://phabricator.wikimedia.org/T122524) (owner: 10Jcrespo) [19:42:26] mutante: I'll restart after lunch [19:43:50] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1906697 (10jcrespo) Aside from that patch, assuming it is granted I *may* have to provide extra mysql grants to the stats user. [19:47:46] gwicke: no rush, thanks [19:54:30] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1906728 (10Krenair) 5Open>3Resolved Steward @Barras dealt with the bureaucrat rights - https://meta.wikimedia.org/w/index.php?title=Steward_requests/Permissions&oldid=15197... [19:56:58] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1906734 (10Krenair) Also, created {T122512}, {T122513}, and {T122520} [20:00:22] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [20:07:23] (03PS1) 10Yuvipanda: labstore: Increase wikitech timeout by a bit [puppet] - 10https://gerrit.wikimedia.org/r/261266 [20:07:32] chasemp: ^ [20:08:59] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Increase wikitech timeout by a bit [puppet] - 10https://gerrit.wikimedia.org/r/261266 (owner: 10Yuvipanda) [20:09:04] (03CR) 10Rush: "seems to happen on large projects (or so we suspect)" [puppet] - 10https://gerrit.wikimedia.org/r/261266 (owner: 10Yuvipanda) [20:25:02] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 34.62% of data above the critical threshold [100000000.0] [20:28:21] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:28:43] ^seems like dumps-1.dumps.eqiad.wmflabs is dropping off for labstore1003 a lot of stuff unknown why to me, but I don't think it's generally errant atm [20:35:00] !log yurik@tin Synchronized php-1.27.0-wmf.9/extensions/Graph/modules/graph2.js: https://gerrit.wikimedia.org/r/#/c/261200/ (duration: 00m 31s) [20:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:44] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:49:04] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:50:53] Dec 28 20:44:28 cygnus salt-minion[5904]: [ERROR ] Error while bringing up minion for multi-master. Is master at neodymium.eqiad.wmnet responding? [20:51:06] Dec 28 20:44:28 cygnus salt-minion[5904]: [ERROR ] Error while bringing up minion for multi-master. All configured masters [palladium.eqiad.wmnet, neodym...ponding!!! [20:51:11] they are ... ponding [20:52:24] !log cygnus - starting salt-minion [20:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:34] pond as a verb [20:53:06] i guess the full message was "..not responding" :p [20:53:14] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:53:30] odd though how this happens randomly on this one machine [20:53:33] (which is a VM) [20:55:31] (03PS1) 10Jcrespo: Repool db1050 & db1022 after emergency maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261271 (https://phabricator.wikimedia.org/T105879) [20:56:58] (03CR) 10Jcrespo: [C: 032] Repool db1050 & db1022 after emergency maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261271 (https://phabricator.wikimedia.org/T105879) (owner: 10Jcrespo) [20:57:05] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [20:59:45] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1050 & db1022 after emergency fix (duration: 00m 31s) [20:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:21] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:20:28] !log aaron@tin Synchronized wmf-config/PrivateSettings.php: $wmfSwiftConfig convenience variable (duration: 00m 30s) [21:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:24] Don't you need to sync-dir private, AaronSchulz? [21:22:33] PrivateSettings is just a symlink.. [21:22:51] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:24:24] so.. more salt minions keep dying [21:24:45] one random VM that does nothing [21:24:55] and then the 2 deployment hosts right when something gets synced.. hmmm [21:25:10] !log aaron@tin Synchronized private/PrivateSettings.php: (no message) (duration: 00m 30s) [21:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:34] tin:~# service salt-minion status [21:25:34] salt-minion stop/waiting [21:26:25] !log tin & mira: started salt minions that were in status stop/waiting [21:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:00] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:27:02] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:32:05] (03PS5) 10Aaron Schulz: filebackend: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [21:34:54] (03PS1) 10Yuvipanda: tools: Do not need NFS for redis [puppet] - 10https://gerrit.wikimedia.org/r/261277 [21:36:26] (03CR) 10Yuvipanda: [C: 032] tools: Do not need NFS for redis [puppet] - 10https://gerrit.wikimedia.org/r/261277 (owner: 10Yuvipanda) [21:40:12] (03PS4) 10Yuvipanda: toollabs: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257534 (owner: 10Ori.livneh) [21:40:59] 6operations, 10Deployment-Systems: salt-minion processes terminate on deployment sync - https://phabricator.wikimedia.org/T122544#1907185 (10Dzahn) 3NEW [21:41:46] (03CR) 10Yuvipanda: [C: 032] "Rebased to take active_redis from hiera. I've also disabled puppet on the old redises, and am going to apply this on new redises (Which ar" [puppet] - 10https://gerrit.wikimedia.org/r/257534 (owner: 10Ori.livneh) [21:44:58] 6operations, 10Deployment-Systems: salt-minion processes terminate on deployment sync - https://phabricator.wikimedia.org/T122544#1907205 (10Dzahn) [21:45:44] !log restbase: rolling restart to apply https://gerrit.wikimedia.org/r/261206 [21:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:47:18] (03PS1) 10Yuvipanda: tools: Truly liberate Redis from NFS [puppet] - 10https://gerrit.wikimedia.org/r/261278 [21:48:45] (03CR) 10Yuvipanda: [C: 032] tools: Truly liberate Redis from NFS [puppet] - 10https://gerrit.wikimedia.org/r/261278 (owner: 10Yuvipanda) [21:52:55] RECOVERY - Restbase root url on restbase1004 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.008 second response time [21:54:45] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [21:59:43] Krenair: https://wikimania2017.wikimedia.org/api/rest_v1/?doc [22:00:44] gwicke, cool, now we need to get parsoid working there [22:00:57] which I think means a backport of the SiteMatrix update commit I made earlier [22:01:18] ori: around? with your patch merged I can't get redis to start and there are no logs [22:01:20] * YuviPanda digs in [22:01:57] Krenair: yeah, I forget if Parsoid fetches that from the API, or uses a private copy of the sitematrix response [22:02:24] it uses a static copy of the sitematrix response [22:03:01] gwicke: Krenair , oh by chance i just got told that [22:03:02] YuviPanda: hey [22:03:19] hey [22:03:29] i reported https://phabricator.wikimedia.org/T122527 and ssastry told me: [22:03:33] YuviPanda: what's the current status? [22:03:43] "We pull it into the parsoid codebase (effectively a cached copy) via https://github.com/wikimedia/parsoid/blob/master/tools/fetch-sitematrix.js .. Can you redirect your questions to enwiki sitematrix updates" [22:03:52] ori: so I started up a new jessie instance and applied the redis role, and redis fails to start without anything in the logs [22:03:56] that's current status [22:03:59] which host? [22:04:04] ori: tools-redis-1001 [22:04:22] there isn't any user facing impact since I haven't fully switched over yet [22:05:12] YuviPanda: mind if i take a look? [22:05:15] so "fetch-sitematrix.js" needs to run i suppos [22:05:16] e [22:05:27] ori: sure, feel free to! [22:05:31] ori: I was hoping you would, even :) [22:05:53] mutante: a task for the parsoid folks, I think [22:06:04] can't ssh in [22:06:26] ori: try root@ [22:06:38] ori: I'll add you to the group that can ssh in anywhere too, just a sec [22:06:47] thanks [22:07:31] ori: done [22:10:06] (03CR) 10EBernhardson: [C: 031] "willing to merge and deploy, but i still think if we don't know what is going to be good we should test in labs on the estest100{1,2,3,4}." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse) [22:12:21] Coren, do you want to use VE on wikimania2017wiki before the next parsoid deployment? [22:14:17] (03PS1) 10Ori.livneh: Fix-up for Ica80cdbb4e [puppet] - 10https://gerrit.wikimedia.org/r/261283 [22:14:23] YuviPanda: ^ [22:15:29] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Ica80cdbb4e [puppet] - 10https://gerrit.wikimedia.org/r/261283 (owner: 10Ori.livneh) [22:17:28] ori: haha ok! [22:18:15] YuviPanda: re-ran puppet, works now [22:18:34] there is still a puppet failure, but it is unrelated (Error: /usr/local/sbin/grain-ensure set trebuchet_master tools-deploy.eqiad.wmflabs returned 1 instead of one of [0]) [22:19:02] lol wat [22:19:40] ori: I suppose that's related to what mutante is investigating [22:20:07] ori: hmm ran successfully now [22:23:48] Coren, also OCG, Flow and CX [22:26:32] ori: can you cd off /home so I can unmount it? :D [22:27:00] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1907298 (10Dzahn) 3NEW a:3Krenair [22:28:02] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1907313 (10Krenair) I did run it and made https://gerrit.wikimedia.org/r/#/c/261207/ - we just need that commit deployed to wikimedia production now AFAIK? [22:30:17] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1907326 (10ssastry) That is correct. Given that it is part of the main parsoid repo, we'll have to do a cherry-pick deploy .. which I can do .. but I was asking if this can wait t... [22:32:02] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [22:35:41] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [22:36:58] ori: hmm, it fails on new instance too [22:37:12] on tools-redis-1002 [22:38:28] #wikipedia-en-help [22:39:06] Well, on Christmas Eve, we ate dinner at my house. Christmas Day, we ate dinner at my great aunt's and great uncle's house. We also ate at my great aunt's and great uncle's house on Thanksgiving Day. This Thanksgiving we didn't because they were in Texas. Last Christmas we didn't because they were in California. This Thanksgiving and Last Christmas Day, we ate at my house. Last Christmas Eve we went out for dinner. [22:41:48] #wikipedia-en-abuse [22:42:20] #wikipedia-en-afc [22:44:16] hmm, -abuse [22:44:18] reminds me of https://youtu.be/kQFKtI6gn9Y?t=41 [22:44:31] sorry [22:44:40] i was using this channel to enter others [22:45:08] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1907350 (10Aklapper) Two days later: Does this problem still happen? [22:45:10] you shouldn't need to do that emplty654 [22:45:22] ok [22:49:11] ori: where did you find logs the last time? [22:50:47] YuviPanda this channel has stopped keeping logs [22:51:32] nah, I'm talking about redis logs emplty654 :) [22:56:11] PROBLEM - DPKG on multatuli is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:57:40] ori: interesting [22:57:42] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please run fetch-sitematrix update - https://phabricator.wikimedia.org/T122548#1907384 (10Krenair) I think that depends on whether they want VE, OCG (PDF rendering), etc. (I think CX and Flow also depend on it, but I'm not sure they get anything enabled by d... [22:57:42] ori: >>> 'slaveof tools-redis-1001.tools.eqiad.wmflabs' [23:00:56] ori: where did you find logs the last time? [23:01:23] 'journalctl -xn' showed Dec 28 22:09:23 tools-redis-1001 redis-server[14120]: Bad directive or wrong number of arguments [23:01:32] hmm I looked at that and didn't spot it [23:01:35] * YuviPanda gives self vial of patience [23:01:38] 'systemctl status redis-instance-tcp_6379.service' did too [23:01:41] I've a patch for the current issue coming up [23:01:43] Dec 28 22:09:23 tools-redis-1001 redis-server[14120]: Bad directive or wrong number of arguments [23:01:51] I saw the 'refusing to start because too many times' [23:02:06] so then i just ran the command in the unit file's exec directive [23:02:07] /usr/bin/redis-server /etc/redis/tcp_6379.conf [23:02:13] and got: [23:02:14] *** FATAL CONFIG FILE ERROR *** [23:02:14] Reading the configuration file, at line 30 [23:02:15] >>> 'require' [23:02:15] ori: (string) 'require' [23:02:17] Bad directive or wrong number of arguments [23:02:20] heh [23:02:22] (03PS1) 10Yuvipanda: tools: Fix redis slaveof to mention port number [puppet] - 10https://gerrit.wikimedia.org/r/261286 [23:02:26] yeah that's what I ended up doing [23:02:50] (03CR) 10Ori.livneh: [C: 031] tools: Fix redis slaveof to mention port number [puppet] - 10https://gerrit.wikimedia.org/r/261286 (owner: 10Yuvipanda) [23:03:18] (03CR) 10Yuvipanda: [C: 032] tools: Fix redis slaveof to mention port number [puppet] - 10https://gerrit.wikimedia.org/r/261286 (owner: 10Yuvipanda) [23:06:05] (03PS1) 10EBernhardson: Adjust cirrus titlesuggest index shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 [23:06:05] RECOVERY - DPKG on multatuli is OK: All packages OK [23:07:44] (03Draft2) 10Ori.livneh: remove redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/257548 [23:09:39] (03PS1) 10Yuvipanda: tools: Redises should bind to all addresses [puppet] - 10https://gerrit.wikimedia.org/r/261289 [23:10:29] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Redises should bind to all addresses [puppet] - 10https://gerrit.wikimedia.org/r/261289 (owner: 10Yuvipanda) [23:10:43] (03PS3) 10Ori.livneh: remove redis::legacy [puppet] - 10https://gerrit.wikimedia.org/r/257548 [23:11:05] YuviPanda: is it ok to remove redis::legacy now, or would you like to keep it around in case you need to roll back the change? [23:11:45] ori: I haven't switched the primaries yet, so maybe another 10-15mins? [23:11:52] oh yeah sure [23:12:06] yeah then we can kill it [23:14:57] (03PS1) 10Yuvipanda: tools: Switch what tools-redis hostname refers to [puppet] - 10https://gerrit.wikimedia.org/r/261291 [23:24:50] (03PS1) 10Yuvipanda: toollabs: Stop using aof for redis [puppet] - 10https://gerrit.wikimedia.org/r/261292 [23:25:37] (03PS2) 10Yuvipanda: toollabs: Stop using aof for redis [puppet] - 10https://gerrit.wikimedia.org/r/261292 [23:25:39] (03PS2) 10Yuvipanda: tools: Switch what tools-redis hostname refers to [puppet] - 10https://gerrit.wikimedia.org/r/261291 [23:25:58] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Stop using aof for redis [puppet] - 10https://gerrit.wikimedia.org/r/261292 (owner: 10Yuvipanda) [23:26:48] ori: btw, just curious, what's the failover process for rdbs? [23:27:23] YuviPanda: what do you mean? [23:27:44] ori: like, if rdb1001 has a hardware failure [23:27:46] and dies [23:27:48] what happens? [23:28:35] it gets ejected from the hash ring by twemproxy, so the keys that were previously mapped to it are now mapped to the remaining servers [23:29:28] ah, I see. what happens to the things that were in the queue there at that point? [23:29:31] if it goes down hard then the users whose session was on that server get logged out [23:29:44] errr that's with mc* [23:30:03] the rdbs all have a slave [23:30:11] *each [23:30:29] right, so how does that slave become master? [23:30:38] redis.php in mediawiki-config has to be edited [23:30:44] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [23:30:56] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [23:31:02] the config lines for replacing the master for each group are there but commented out [23:31:25] ah [23:31:29] is that in puppet or mw-config? [23:31:32] puppet [23:31:33] puppet I suppose? [23:31:36] errr [23:31:39] mw-config [23:31:46] but puppet has to change too [23:31:49] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:32:12] ok, it is apparent that my attempt to answer your question while also doing something else isn't working [23:32:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:32:27] how important is it to you to know? if very, i'll stop what i'm doing and walk you through the process [23:33:08] /hieradata/eqiad/mediawiki/jobrunner.yaml has to change and redis.php in mediawiki-config.git has to change [23:33:37] ori: np, not important atm :) [23:41:30] (03CR) 10Yuvipanda: [C: 032] tools: Switch what tools-redis hostname refers to [puppet] - 10https://gerrit.wikimedia.org/r/261291 (owner: 10Yuvipanda) [23:44:33] (03PS1) 10Yuvipanda: tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/261295 [23:44:48] (03PS1) 10EBernhardson: [test only] Stricter avro schema tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261296 [23:44:50] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/261295 (owner: 10Yuvipanda) [23:45:07] (03PS2) 10EBernhardson: [test only] Stricter avro schema tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261296 [23:45:56] comeone gerrit [23:46:00] before my puppet fail of shame [23:46:14] (03PS1) 10Yuvipanda: tools: Typofix for typofix [puppet] - 10https://gerrit.wikimedia.org/r/261297 [23:46:14] wtf [23:46:26] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Typofix for typofix [puppet] - 10https://gerrit.wikimedia.org/r/261297 (owner: 10Yuvipanda) [23:46:38] there [23:47:02] Now press the revert button a few times [23:47:48] * YuviPanda presses Reedy a few times [23:49:27] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: puppet fail [23:49:34] do I need to give you the pamphlet? [23:50:33] apergos: you know RoanKattouw actually has a copy [23:51:05] from when? he can't have The Pamphlet, only a knock-off [23:51:50] apergos: I think it was sourced from Trevor [23:52:01] hm ok that's possible [23:52:33] and on that weird note, time for bed [23:52:45] that means time for you to go home, YuviPanda [23:53:10] or does itmean time for you to go into the office? [23:53:18] anyways... night! [23:53:31] (03PS2) 10Ori.livneh: Lower redis connection timeout from 1s to 0.5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255560 [23:53:48] apergos: night [23:56:21] (03CR) 10Ori.livneh: [C: 032] Lower redis connection timeout from 1s to 0.5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255560 (owner: 10Ori.livneh) [23:57:03] (03Merged) 10jenkins-bot: Lower redis connection timeout from 1s to 0.5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255560 (owner: 10Ori.livneh) [23:57:12] 6operations, 10Traffic: Wikimedia servers won't serve images - https://phabricator.wikimedia.org/T122453#1907514 (10faidon) 5Open>3Resolved a:3faidon @Aklapper this should be fixed now, yeah. There are a couple of other bugs here, one is tracked with T122455, the other one is for a feature that got remov... [23:58:37] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [23:58:39] 6operations, 10Traffic: Varnish leaks memory - https://phabricator.wikimedia.org/T122455#1907517 (10faidon) p:5Unbreak!>3High I reverted the TBF entirely with 4c07fac36de29eca061cb1d99d5a48464623a8d4 because of the restarts above triggering T122453 again, plus the memory leak attributed to how we're operat...